Feat : k_means Algorithm #252

MeKaustubh07 · 2025-10-18T17:02:07Z

K-Means Clustering Algorithm – Summary Review

Overview

K-Means is a centroid-based clustering algorithm that divides n data points into k clusters using centroids (means). It minimizes within-cluster variance and is known for its simplicity and speed.

Algorithm Summary

Initialize: Choose k centroids (random or k-means++)
Assign: Each point → nearest centroid
Update: Recalculate centroids as mean of assigned points
Repeat: Until centroids stabilize or max iterations reached

Key Traits:

Uses centroids (means)
Minimizes intra-cluster distance
Iterative refinement until convergence

Complexity

Time: O(n × k × d × i)
Space: O(n × d + k × d)
Scalability: Excellent for large datasets

Initialization Methods

K-Means++ (Recommended): Faster convergence, O(n × k × d)
Random: Simpler but less stable
Custom: User-defined centroids

Quality Metrics

Metric	Formula / Concept	Ideal Value
Inertia	Σ
Silhouette Score	(b(i) - a(i)) / max(a(i), b(i))	Close to 1
Davies-Bouldin Index	Avg cluster similarity ratio	Lower = better
Calinski-Harabasz Index	Between / Within variance ratio	Higher = better

Key Methods

Method	Description
`fit(X)`	Train model
`predict(X)`	Assign clusters
`fit_predict(X)`	Train + predict
`transform(X)`	Distances to centroids
`silhouette_score()`	Cluster quality
`get_centroids()`	Retrieve final centroids

Advantages ✅

Fast & Scalable — Efficient for large datasets
Simple & Intuitive — Easy to interpret geometrically
Guaranteed Convergence — Always reaches a local minimum
Flexible — Works in any number of dimensions
Memory Efficient — Linear space complexity

Disadvantages ❌

Requires predefined k
Sensitive to initialization (mitigated by k-means++)
Assumes spherical clusters
Sensitive to outliers
Can converge to local minima
Scale-dependent — Requires feature standardization

Best Practices

Standardize data: X_scaled <- scale(X)
Find optimal k: Use elbow or silhouette method
Use k-means++: Better initialization
Set tolerance: tol = 1e-4, limit iterations with max_iter = 300

Use Cases

✅ Customer segmentation
✅ Image compression (color quantization)
✅ Document clustering
✅ Anomaly detection
✅ Exploratory data analysis

❌ Avoid for: non-spherical clusters, mixed data, or many outliers

Performance Tips

Large datasets: Use Mini-Batch K-Means for faster results
High dimensions: Apply PCA or feature selection
Early stopping: Use relaxed tolerance (e.g., tol = 1e-3)

Comparison with Other Algorithms

Algorithm	Time	Cluster Shape	Outlier Robustness	k Required
K-Means	O(nkdi)	Spherical	Low	Yes
K-Medoids	O(k(n-k)²i)	Any	High	Yes
DBSCAN	O(n log n)	Arbitrary	High	No
GMM	O(nkdi)	Elliptical	Medium	Yes
Hierarchical	O(n³)	Any	Medium	No

When to Use

✅ Large datasets
✅ Well-separated spherical clusters
✅ Continuous numeric data
✅ Need quick, interpretable results

When to Avoid

❌ Unknown k
❌ Non-spherical or overlapping clusters
❌ Heavy noise or outliers
❌ Categorical or mixed data

Verdict ⭐⭐⭐⭐☆ (4/5)

K-Means is a fast, scalable, and easy-to-use clustering algorithm ideal for large datasets and exploratory tasks.
However, it’s sensitive to initialization, scale, and outliers.

Strengths: Speed, simplicity, scalability
Weaknesses: Sensitive to initialization/outliers
Best For: Customer segmentation, image compression, initial data exploration

Copilot

Pull Request Overview

This PR introduces a comprehensive implementation of the K-Means clustering algorithm in R, including k-means++ initialization, multiple quality metrics, and extensive examples.

Key changes:

Complete K-Means clustering implementation with R6 class structure
Support for multiple initialization methods (random, k-means++, custom)
Four clustering quality metrics (silhouette, Davies-Bouldin, Calinski-Harabasz, inertia)
Comprehensive examples demonstrating usage patterns and best practices

machine_learning/k_means.r

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

machine_learning/k_means.r

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

machine_learning/k_means.r

siriak

How is it different from the already implemented versions?

Feat : k_means Algorithm

746d676

Copilot AI review requested due to automatic review settings October 18, 2025 17:02

MeKaustubh07 requested review from acylam and siriak as code owners October 18, 2025 17:02

Copilot AI reviewed Oct 18, 2025

View reviewed changes

machine_learning/k_means.r Outdated Show resolved Hide resolved

machine_learning/k_means.r Outdated Show resolved Hide resolved

Update machine_learning/k_means.r

f26242a

Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings October 20, 2025 06:20

MeKaustubh07 and others added 2 commits October 20, 2025 11:50

Update machine_learning/k_means.r

adc297b

Co-authored-by: Copilot <[email protected]>

Merge branch 'master' into k_means

17f43a9

Copilot AI reviewed Oct 20, 2025

View reviewed changes

machine_learning/k_means.r Show resolved Hide resolved

machine_learning/k_means.r Outdated Show resolved Hide resolved

Update machine_learning/k_means.r

cab4b4e

Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings October 20, 2025 06:21

Copilot AI reviewed Oct 20, 2025

View reviewed changes

siriak reviewed Oct 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feat : k_means Algorithm #252

Feat : k_means Algorithm #252

Uh oh!

MeKaustubh07 commented Oct 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siriak left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Feat : k_means Algorithm #252

Are you sure you want to change the base?

Feat : k_means Algorithm #252

Uh oh!

Conversation

MeKaustubh07 commented Oct 18, 2025

K-Means Clustering Algorithm – Summary Review

Overview

Algorithm Summary

Complexity

Initialization Methods

Quality Metrics

Key Methods

Advantages ✅

Disadvantages ❌

Best Practices

Use Cases

Performance Tips

Comparison with Other Algorithms

When to Use

When to Avoid

Verdict ⭐⭐⭐⭐☆ (4/5)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siriak left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants