Skip to content

HPI-Information-Systems/DendroTime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DendroTime

Progressive HAC system for time series anomalies.

License: MIT Scala version 3.3 python version 3.8|3.9|3.10


Many effective dissimilarity measures for variable-length time series, such as Dynamic Time Warping (DTW), Move-Split-Merge (MSM), or Time Warp Edit Distance (TWED), are expensive to compute because their runtimes increase quadratically with the time series’ lengths. When used in hierarchical agglomerative clustering (HAC) algorithms that need to compute all pairwise time series dissimilarities, they cause slow runtimes and do not scale to large time series collections. However, there are use cases, where fast, interactive hierarchical clustering is necessary. For these use cases, progressive hierarchical clustering algorithms can improve runtimes and interactivity. Progressive algorithms are incremental algorithms that produce and continuously improve an approximate solution, which eventually converges to the exact solution.

In this paper, we present DendroTime, the first (parallel) progressive hierarchical clustering system for variable-length time series collections. The system incrementally computes the pairwise dissimilarities between the input time series and supports different ordering strategies to achieve progressivity. Our evaluation demonstrates that DendroTime’s progressive strategies are very effective for clustering scenarios with expensive time series dissimilarity computations.

Architecture

DendroTime consists of a reactive client-server architecture that executes the progressive HAC algorithm on a collection of time series (anomalies) in a highly concurrent execution server and visualizes dendrogram updates in a web-based client. The execution server is implemented using the actor programming model (Akka), which is a reactive programming paradigm for concurrent and parallel applications. Its core primitives are actors, which are objects with private state and behavior. The following figures provides an overview of DendroTime's client-server architecture and its actors:

DendroTime architecture

DendroTime supports the following measures to compute the dissimilarities between time series:

  • Minkowsky distances (i.a., Euclidean)
  • Lorentzian distance
  • MSM
  • DTW
  • SBD
  • KDTW

The following linkage functions are supported and compatible:

  • single
  • complete
  • average
  • weighted

Other linkage functions, such as Ward, median, or centroid linkage, are not compatible with non-metric time series dissimilarity measures, but can be enabled in code: de.hpi.fgis.dendrotime.model.ParametersModel.DendroTimeParams#areCompatible. The client is web-based and visualizes the dendrogram as well as the computational and qualitative progress to allow the user to monitor the clustering results over time, and potentially stop the process early:

DendroTime-clustering-ACSF1.mp4

DendroTime’s frontend allows you to select a dissimilarity measure, a linkage method, and a progressive strategy to cluster a specific dataset. For each job, it visualizes the completed processing steps for each phase in % (top), the convergence indicator (middle), and changes to the dendrogram (bottom).

Repository structure

The following table provides an overview about the most important contents in this repository:

Folder Description
Scala modules
bloom-filter Adaptation of Alexandr Nikitin's bloom-filter library for Scala 3
progress-bar Progress bar library similar to tqdm, used in Scala-CLI scripts
dendrotime-io DendroTime's IO module: CSV reader and writer, .ts-file parser, time series data structure
dendrotime-clustering DendroTime's clustering module: Time series dissimilarities, linkage methods, MST and NN-chain algorithms, dissimilarity vector, stepwise dendrogram (hierarchy), and corresponding IO.
dendrotime-frontend DendroTime's frontend (TypeScript, React, TailwindCSS)
dendrotime-backend DendroTime system: Full-featured web application including the frontend
dendrotime-runner DendroTime runner: CLI application without the user-facing features, used for paper experiments.
dendrotime-evaluator DendroTime evaluator: CLI application to compute metrics for dendrograms (hierarchies)
dendrotime-benchmarking Contains jmh micro-benchmarks.
Folders
docs Hosts doc files and resources
experiments Paper experiment scripts for reproducibility
results Aggregated results from the experiments
scripts Collection of various Scala and Python scripts
Files
build.sbt SBT build configuration
requirements.txt Python dependencies required for the scripts and experiments

Experiments and results

DendroTime supports multiple strategies to order the time series dissimilarities. The following three are the most important ones:

  • fcfs: The fcfs baseline strategy simply orders the dissimilarity computations in the order we load the time series from disk.
  • ada: The ada strategy defines an ordering heuristic that is based on the dissimilarities from the approximation phase, from small to large.
  • precl: The precl strategy follows a three-step approach (inspired by JET): The first step (intra-pre-cluster) partitions the approximated dendrogram into 3√𝑛 pre-clusters and then computes the exact dissimilarities for all time series pairs within these pre-clusters. The second step (medoids) determines the medoid for each pre-cluster and computes the dissimilarities of all medoid pairs. The final step (inter-pre-cluster) computes the exact dissimilarities of all inter-pre-cluster pairs. We compute the dissimilarities between time series from close pre-clusters before those between far-apart pre-clusters.

We compare DendroTime with three baseline algorithms:

  • PARALLEL HAC: This baseline is a multithreaded Scala implementation of traditional linkage-based HAC. It computes the pairwise time series dissimilarities in parallel and, then, constructs the final, exact stepwise dendrogram in the end. You can execute the PARALLEL baseline using the DendroTime runner:

    java -jar DendroTime-runner.jar --parallel --dataset <dataset-name>
    

    There is also a SERIAL baseline that does not use any parallelism: --serial.

  • JET: JET is an approximate algorithm for hierarchical clustering of time series anomalies. It is implemented in Python, and we execute it with Python version 3.9.21. After computing a coarse-grained pre-clustering of the time series using the very efficient BIRCH algorithm, it computes the exact dissimilarities just for intra-pre-cluster time series and between the pre-cluster medoids. The dissimilarity computations are executed in parallel. You can execute JET from the experiments/06-jet/ folder:

    cd experiments/06-jet
    python run-jet.py --datafolder ../../data/datasets --dataset <dataset-name>
    
  • HappieClust: HappieClust is another approximate algorithm for hierarchical clustering of time series. It is also implemented in Python, and we execute it with Python version 3.9.21. Our implementation is based on the implementation of the JET paper. HappieClust computes pseudo-distances in a low-dimensional pivot-space to estimate the time series dissimilarities and construct the stepwise dendrogram afterwward. All dissimilarity computations are executed in parallel. You can run HappieClust from the experiments/10-happieclust/ folder:

    cd experiments/10-happieclust
    python run-happieclust.py --datafolder ../../data/datasets --dataset <dataset-name>
    

We perform the experiments on 123 univariate datasets without missing values from the UCR time series classification archive and 12 univariate anomaly datasets extracted from greenhouse telemetry of the EDENISS project. The UCR datasets are hosted publicly and can be downloaded with the download_datasets.py-script.

legend dtw-average convergence

Horizontal axis shows relative runtime w.r.t the PARALLEL baseline runtime and vertical axis shows dendrogram similarity w.r.t to the exact dendrogram (= quality).

DendroTime vs. JET and HappieClust

Although JET and HappieClust are approximate approaches for calculating fast HAC sketches, they are on average slower than DendroTime. For larger datasets and excessively costly dissimilarity measures (i.a. KDTW), their Python overheads diminish and they are indeed faster. Nevertheless, DendroTime achieves for all datasets at least similar quality after the same runtime.

DendroTime vs. PARALLEL

Because DendroTime eventually computes the same result as PARALLEL, but with additional approximation and ordering steps in the process, it takes up to 1.5x time to finish the exact dendrogram, as expected. Despite DendroTime continuously producing approximate results, its progressive strategies can achieve high qualities in a shorter or similar runtime as PARALLEL.

Citation

When using this software please cite our paper:

Sebastian Schmidl, Ferdinand Rewicki, Felix Naumann, Thorsten Papenbrock: DendroTime: Progressive Hierarchical Clustering for Variable-Length Time Series. Proceedings of the International Conference on Extending Database Technology (EDBT), 2026.

Installation

Please make sure that you have a recent Java runtime environment on your machine.

Building

Prerequisites

  • Git
  • Java >= 21.0.0
  • SBT >= 1.10.0
  • (Scala 3 is managed by SBT)
  • Node.js >= v22.12.0
  • npm >= 11.0.0
  • Python >= 3.8

Procedure

  1. Clone the GitHub repository

    git clone [email protected]:HPI-Information-Systems/DendroTime.git
    
  2. Build the whole project and create the runtime artifacts. The web server version (backend + frontend) is built using:

    sbt assembly
    

    If you just want to build the CLI-version use:

    sbt runner/assembly
    

    You can find the fat-jar for the DendroTime runner in the experiments-folder.

Usage of the web application

Run DendroTime from sbt with

run <hostname> <port>

or build a JAR for DendroTime using SBT and then start the application from the JAR-file:

sbt assembly
mv target/scala-3.3.3/DendroTime-assembly*.jar DendroTime-server.jar
java -Dfile.encoding=UTF-8 -Dlogback.configurationFile=<logging-config-file> -Dconfig.file=<config-file> -jar DendroTime.jar <hostname> <port>

You can then access the frontend using a modern browser at http://<hostname>:<port>/.

If no datasets show up in the drop-down menu, please check the path to the datasets-folder in the configuration file!

Usage from the CLI

We also provide a headless version of DendroTime for experiments and scripts (DendroTime-runner). It can be started from the CLI and takes the dataset, distance, linkage, and strategy as arguments. Use --help to show options and arguments.

You build the DendroTime-runner using the runner SBT-subproject:

sbt "runner/assembly"

The JAR-file will be placed at experiments/DendroTime-runner.jar. Afterward, you can execute the DendroTime-runner with the ./run-dendrotime.sh-script. The JAR-file includes the code for the PARALLEL baseline. It can be executed using the ./run-baseline.sh-script.

Algorithm configuration

DendroTime exposes various configuration parameters to control its execution. They can be configured via CLI or a configuration file. DendroTime's configuration options are described in DendroTime configuration

Experiments & Reproducibility

The configuration of DendroTime and the baselines used for the experiments in the paper can be found in the experiments-folder. Note that we do not publish all results of the experiments in this Github-Repository due to their size. The results-folder contains only the aggregated quality and runtimes measuresments of the main experiments. Please contact us directly for the experiments' full result backups.

License

The DendroTime Software and all accompanying scripts are licensed under MIT license.

The edeniss20182020_anomaly datasets in experiments/09-edeniss-case-study/datasets/edeniss20182020_anomalies are licensed under CCY-BY-4.0.