I have curently been intepreting and acting on the advice that Dr. Asha Gowda gave me in an interview I had with her. Details of our interview can be accessed here.

Instead of working on the k-means algorithm like originally planned, I am now working running the k-modes algorithm. The k-modes algorithm will not generate a graph as previously proposed in my experimental design, but will display centroids that can give me an insight on how my binary classifiers are classifying mushrooms in my dataset.

First, I learned k-modes was not part of scikitlearn and needed to be installed from Python’s pip like so: alt-text-1

Then, I ran the k-modes algorithm on the dataset. I began by running it without the Elbow Method and choosing an arbitrary number of clusters. alt-text-1

  • verbose means the output will describe the procecesses more.
  • n_init means the number of times the k-modes algorithm will be run with different centroid seeds.
  • Huang means the kmodes method described by Huang in 1997 and 1998 will be used.
  • n_clusters means the number of clusters to form.

The verbose output was condensed into the following results:

  • Run 1 cost: 49750.0
  • Run 2 cost: 54324.0
  • Run 3 cost: 56293.0
  • Run 4 cost: 49750.0
  • Run 5 cost: 50146.0

I printed out the centroids generated from the clustering algorithm trained on the dataset. alt-text-1

The best run was number 1, whose four cluster centroids were the following: alt-text-1 Specifics on the mushrooms these centroids represent can be accessed here.

These centroids are an average observation for their respective clusters. Therefore, the two centroids labelled as poisonous show, on average, what groups of traits likely result in a poisonous mushroom.

Likewise, the two centroids labelled as edible , on average, what groups of traits likely result in an edible mushroom.

Conclusion/ TL;DR: A mushroom with attributes that most closely match centroids 1 and 2 is likely to be classified as poisonous by my model, and a mushroom with attributes that most closely match centroids 3 and 4 is likely to be classified as edible by my model.

Note that these results are just ballparks and not exact!

I am currently working on seeing if I could use the Elbow Method to run k-modes clustering on the optimal number of clusters.