The code for my research project has been completed up until classification of mushroom instances and overfitting mitigation.

Here are the key components and outputs of my code:

Encoding categorical data:

alt-text-1 alt-text-1 alt-text-1

Feature Selection:

  • Null Hypothesis: feature and DV are independent
  • Alternate Hypothesis: feature and DV are not independent alt-text-1 alt-text-1

Define Evaluation Metrics: alt-text-1

  • Accuracy: percentage of correct predictions
  • Precision: Rate of correctly predicting toxic mushrooms
  • Recall: Rate of correctly predicting edible mushrooms

Overfitting Mitigation: K fold cross validation was used, with the stratified variant in particular. Due to this implementation, it will take longer to run the model, but since this dataset is not very large, it should be fine. alt-text-1

Why stratified? It will retain the percentage of samples for each class when implemented to make the splits. It would not be good to have one group with too many red mushrooms, for example.

Logistic Regression classifier: alt-text-1

Naive Bayes classifier: alt-text-1

Support Vector Machine Classifier: alt-text-1

I still have to write code for the clustering analysis on my dataset.