Step 1: Data Preprocessing
Dataset specificsAbove is a portion of how the dataset is set up, the class column being the dependent variable vector.
- Dataset taken from UCI Machine Learning Repository
- 8124 instances of mushroom data
- 48% poisonous, 52% nonpoisonous (balanced dataset)
- 22 attributes
- 23 features
- Handle missing data of stalk-root feature by making missing values its own category in the stalk root feature.
- Encode categorical data using one-hot encoding.
- Use the Chi-Square test for Independence to conduct feature selection
Step 2: Overfitting Mitigation
K fold cross validation was used, with the stratified variant in particular. Due to this implementation, it will take longer to run the model, but since this dataset is not very large, it should be fine.
Why stratified? It will retain the percentage of samples for each class when implemented to make the splits. It would not be good to have one group with too many red mushrooms, for example.
Step 3: Train binary classifiers
- Logistic Regression
- Naive Bayes
- Support Vector Machine
The evaluation metrics will be:
- Accuracy: percentage of correct predictions
- Precision: Rate of correctly predicting toxic mushrooms
- Recall: Rate of correctly predicting edible mushrooms
Step 4: Conduct clustering analysis
Current plan is to use k-modes clustering.