Yesterday, I conducted a Zoom meeting with Dr. Maitreyee Dutta, head of the computer science department at National Institute of Technical Teachers Training and Research, also called NITTTR in Chandigarh, India.

Her research paper was about the classification of mushrooms using machine learning algorithms, namely the ANFIS algorithm (Adaptive neuro-fuzzy inference system).

Important points discussed were:

Missing values:

I first described to her how I used the SimpleImputer class to impute missing values using the median strategy. I told her I had 2480 missing values, and since there were so many missing values, she said it would be better to make missing values its own category within the stalk root feature. This is because she said imputation would simply lead to skewing and unwanted innacuracy.

Feature selection:

She said I should conduct feature selection using the Chi-Squared test of independence. She also claimed that there is a way to conduct this test in Python using the scipy library.

The chi-squared test will check if the relationship between each feature and toxicity status is significant. The chi-square test will be run separately for each feature to see if each feature’s relationship with the dependent variable vector is significant. She said I should use a for-loop to do this.

Then, she said I should use an if-else statement with 0.05 alpha level to reject or not reject null hypothesis.

The feature selection happens were if the null hypothesis states they are independent, and you reject it, then that feature is significant. Otherwise, that feature is not significant, and you can simply drop it.

Overfitting:

She stated that I should use k-fold cross validation to tackle overfitting. It will take longer to run the model, but since this dataset is not very, very large she said it should be fine.

She specifically recommended that I should use a stratified k-fold cross validator, because it will retain the percentage of samples for each class when I implement it to make the splits.

My meeting with her can be accessed here.