Here, we provide its prototype implementation in Python and in Java, and an experimental pipeline for evaluating its accuracy on synthetic and real-world datasets and also its update and query times, in comparison with t-digest, KLL, and MomentSketch.
Setup: Clone the repository and then run make
to compile the Java wrappers that run the individual skech.
There are four experimental pipelines, with parameters adjusted in the individual Python source codes:
- Accuracy and running time experiments on synthetic datasets: run with
python run_experiments_IID.py
- Accuracy and running time experiments on real-world datasets: download datasets as described below and then run with
python run_experiments_datasets.py
(optionally adjust the datasets inload_<dataset>_data
functions) - Update time experiment: run with
python run_experiments_update_time.py
- Query time experiment: run with
python run_experiments_query_time.py
All of these Python programs produce a set of plots with results into plots/
directory.
- HEPMASS dataset from UC Irvine ML Repository: download
all_train.csv.gz
andall_test.csv.gz
and decompress both files intodatasets/hepmass/
- Power dataset from UC Irvine ML Repository: download into
datasets/household_power_consumption/household_power_consumption.txt
- Books dataset from SOSD (a benchmark for learned indexes): download using
download_books_dataset.sh
.