Skip to content

Conversation

@MeKaustubh07
Copy link
Contributor

K-Means Clustering Algorithm – Summary Review

Overview

K-Means is a centroid-based clustering algorithm that divides n data points into k clusters using centroids (means). It minimizes within-cluster variance and is known for its simplicity and speed.


Algorithm Summary

  1. Initialize: Choose k centroids (random or k-means++)
  2. Assign: Each point → nearest centroid
  3. Update: Recalculate centroids as mean of assigned points
  4. Repeat: Until centroids stabilize or max iterations reached

Key Traits:

  • Uses centroids (means)
  • Minimizes intra-cluster distance
  • Iterative refinement until convergence

Complexity

  • Time: O(n × k × d × i)
  • Space: O(n × d + k × d)
  • Scalability: Excellent for large datasets

Initialization Methods

  • K-Means++ (Recommended): Faster convergence, O(n × k × d)
  • Random: Simpler but less stable
  • Custom: User-defined centroids

Quality Metrics

Metric Formula / Concept Ideal Value
Inertia Σ
Silhouette Score (b(i) - a(i)) / max(a(i), b(i)) Close to 1
Davies-Bouldin Index Avg cluster similarity ratio Lower = better
Calinski-Harabasz Index Between / Within variance ratio Higher = better

Key Methods

Method Description
fit(X) Train model
predict(X) Assign clusters
fit_predict(X) Train + predict
transform(X) Distances to centroids
silhouette_score() Cluster quality
get_centroids() Retrieve final centroids

Advantages ✅

  1. Fast & Scalable — Efficient for large datasets
  2. Simple & Intuitive — Easy to interpret geometrically
  3. Guaranteed Convergence — Always reaches a local minimum
  4. Flexible — Works in any number of dimensions
  5. Memory Efficient — Linear space complexity

Disadvantages ❌

  1. Requires predefined k
  2. Sensitive to initialization (mitigated by k-means++)
  3. Assumes spherical clusters
  4. Sensitive to outliers
  5. Can converge to local minima
  6. Scale-dependent — Requires feature standardization

Best Practices

  • Standardize data: X_scaled <- scale(X)
  • Find optimal k: Use elbow or silhouette method
  • Use k-means++: Better initialization
  • Set tolerance: tol = 1e-4, limit iterations with max_iter = 300

Use Cases

✅ Customer segmentation
✅ Image compression (color quantization)
✅ Document clustering
✅ Anomaly detection
✅ Exploratory data analysis

❌ Avoid for: non-spherical clusters, mixed data, or many outliers


Performance Tips

  • Large datasets: Use Mini-Batch K-Means for faster results
  • High dimensions: Apply PCA or feature selection
  • Early stopping: Use relaxed tolerance (e.g., tol = 1e-3)

Comparison with Other Algorithms

Algorithm Time Cluster Shape Outlier Robustness k Required
K-Means O(nkdi) Spherical Low Yes
K-Medoids O(k(n-k)²i) Any High Yes
DBSCAN O(n log n) Arbitrary High No
GMM O(nkdi) Elliptical Medium Yes
Hierarchical O(n³) Any Medium No

When to Use

✅ Large datasets
✅ Well-separated spherical clusters
✅ Continuous numeric data
✅ Need quick, interpretable results

When to Avoid

❌ Unknown k
❌ Non-spherical or overlapping clusters
❌ Heavy noise or outliers
❌ Categorical or mixed data


Verdict ⭐⭐⭐⭐☆ (4/5)

K-Means is a fast, scalable, and easy-to-use clustering algorithm ideal for large datasets and exploratory tasks.
However, it’s sensitive to initialization, scale, and outliers.

Strengths: Speed, simplicity, scalability
Weaknesses: Sensitive to initialization/outliers
Best For: Customer segmentation, image compression, initial data exploration

Copilot AI review requested due to automatic review settings October 18, 2025 17:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive implementation of the K-Means clustering algorithm in R, including k-means++ initialization, multiple quality metrics, and extensive examples.

Key changes:

  • Complete K-Means clustering implementation with R6 class structure
  • Support for multiple initialization methods (random, k-means++, custom)
  • Four clustering quality metrics (silhouette, Davies-Bouldin, Calinski-Harabasz, inertia)
  • Comprehensive examples demonstrating usage patterns and best practices

Copilot AI review requested due to automatic review settings October 20, 2025 06:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings October 20, 2025 06:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

Copy link
Member

@siriak siriak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it different from the already implemented versions?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants