Skip to content

lachhebo/pyclustertend

Repository files navigation

pyclustertend

Build Status PyPi Status Documentation Status Downloads codecov DOI

pyclustertend is a Python package specialized in cluster tendency. Cluster tendency consist to assess if clustering algorithms are relevant for a dataset.

Three methods for assessing cluster tendency are currently implemented and one additional method based on metrics obtained with a KMeans estimator:

  • Hopkins Statistics 1
  • VAT 2
  • iVAT 3
  • Metric based method (Silhouette 4, Calinski-Harabasz 5, Davies-Bouldin 6)

Installation

The package is available on PyPi and can be installed using pip:

pip install pyclustertend

Usage

Example Hopkins

    >>> from sklearn import datasets
    >>> from pyclustertend import hopkins
    >>> from sklearn.preprocessing import scale
    >>> X = scale(datasets.load_iris().data)
    >>> hopkins(X,150)
    0.18950453452838564

Example VAT

    >>> from sklearn import datasets
    >>> from pyclustertend import vat
    >>> from sklearn.preprocessing import scale
    >>> X = scale(datasets.load_iris().data)
    >>> vat(X)

Example iVat

    >>> from sklearn import datasets
    >>> from pyclustertend import ivat
    >>> from sklearn.preprocessing import scale
    >>> X = scale(datasets.load_iris().data)
    >>> ivat(X)

Notes

It's preferable to scale the data before using hopkins or vat algorithm as they use distance between observations. Moreover, vat and ivat algorithms do not really fit to massive databases. A first solution is to sample the data before using those algorithms.

Footnotes

  1. Hopkins, Brian; Skellam, J.G. (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. 18 (2). Annals Botany Co: 213–227. doi:10.1093/oxfordjournals.aob.a083391

  2. Bezdek, James C.; Hathaway, Richard J. (2002). "VAT: A Tool for Visual Assessment of (Cluster) Tendency". Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN '02. IEEE Computer Society. pp. 2225–2230. doi:10.1109/IJCNN.2002.1007487.

  3. Wang, L., Nguyen, U. T., Bezdek, J. C., Leckie, C. A., Ramamohanarao, K. (2010). "iVAT and aVAT: enhanced visual analysis for cluster tendency assessment". In Advances in Knowledge Discovery and Data Mining. PAKDD 2010 (pp. 16-27). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-13657-3_5.

  4. Rousseeuw, Peter J. (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7.

  5. Caliński, Tadeusz; Harabasz, Jerzy (1974). "A dendrite method for cluster analysis". Communications in Statistics. 3 (1): 1–27. doi:10.1080/03610927408827101

  6. Davies, David L.; Bouldin, Donald W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224–227. doi:10.1109/TPAMI.1979.4766909