pyclustertend is a Python package specialized in cluster tendency. Cluster tendency consist to assess if clustering algorithms are relevant for a dataset.
Three methods for assessing cluster tendency are currently implemented and one additional method based on metrics obtained with a KMeans estimator:
- Hopkins Statistics 1
- VAT 2
- iVAT 3
- Metric based method (Silhouette 4, Calinski-Harabasz 5, Davies-Bouldin 6)
The package is available on PyPi and can be installed using pip:
pip install pyclustertend >>> from sklearn import datasets
>>> from pyclustertend import hopkins
>>> from sklearn.preprocessing import scale
>>> X = scale(datasets.load_iris().data)
>>> hopkins(X,150)
0.18950453452838564 >>> from sklearn import datasets
>>> from pyclustertend import vat
>>> from sklearn.preprocessing import scale
>>> X = scale(datasets.load_iris().data)
>>> vat(X) >>> from sklearn import datasets
>>> from pyclustertend import ivat
>>> from sklearn.preprocessing import scale
>>> X = scale(datasets.load_iris().data)
>>> ivat(X)It's preferable to scale the data before using hopkins or vat algorithm as they use distance between observations. Moreover, vat and ivat algorithms do not really fit to massive databases. A first solution is to sample the data before using those algorithms.
Footnotes
-
Hopkins, Brian; Skellam, J.G. (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. 18 (2). Annals Botany Co: 213–227. doi:10.1093/oxfordjournals.aob.a083391 ↩
-
Bezdek, James C.; Hathaway, Richard J. (2002). "VAT: A Tool for Visual Assessment of (Cluster) Tendency". Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN '02. IEEE Computer Society. pp. 2225–2230. doi:10.1109/IJCNN.2002.1007487. ↩
-
Wang, L., Nguyen, U. T., Bezdek, J. C., Leckie, C. A., Ramamohanarao, K. (2010). "iVAT and aVAT: enhanced visual analysis for cluster tendency assessment". In Advances in Knowledge Discovery and Data Mining. PAKDD 2010 (pp. 16-27). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-13657-3_5. ↩
-
Rousseeuw, Peter J. (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7. ↩
-
Caliński, Tadeusz; Harabasz, Jerzy (1974). "A dendrite method for cluster analysis". Communications in Statistics. 3 (1): 1–27. doi:10.1080/03610927408827101 ↩
-
Davies, David L.; Bouldin, Donald W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224–227. doi:10.1109/TPAMI.1979.4766909 ↩

