Skip to content

Conversation

ShindeShivam
Copy link
Contributor

@ShindeShivam ShindeShivam commented Jul 9, 2025

While going through the semi-supervised learning example in the unsupervised learning section, I initially got a slightly different accuracy when I manually labeled the representative digits based on what I actually saw in the images.

I initially thought it was an issue in my code, but to verify, I ran the original notebook as-is on Google Colab — and to my surprise, the model's accuracy was just 7%.

After digging into the code, I found the problem:

Screenshot 2025-07-06 at 1 46 52 PM

The hardcoded labels for the 50 representative digits (y_representative_digits) no longer match the current cluster centroids generated by KMeans. This is likely due to internal changes in the dataset order or scikit-learn's clustering behavior (like randomness in centroid initialization or data shuffling).

Because of this mismatch, the model was being trained on incorrect image-label pairs, leading to terrible accuracy.

Fix:

Replaced outdated y_representative_digits with correct labels (manually reassigned by inspecting the actual centroids).

Note:

I also have an earlier PR open
#196
Kindly review that one as well.

@ageron
Copy link
Owner

ageron commented Aug 10, 2025

Thanks for your feedback. I can't test this right now because openml.org seems to be down (I'm getting a 404 error when downloading MNIST). I'll try again asap.

@ShindeShivam ShindeShivam changed the title Fix wrong manual labels for KMeans representative digits (was causing ~7% accuracy) semi-supervised learning example Fix manual labels for KMeans representative digits (was causing ~7% accuracy) semi-supervised learning example Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants