Lakebench

A/B testing for lakehouse architectures on Kubernetes.

Deploy a complete lakehouse stack from a single YAML, run a medallion pipeline at any scale, and get a scorecard you can compare across configurations.

Why Lakebench?

Compare stacks. Swap catalogs (Hive, Polaris), query engines (Trino, Spark Thrift, DuckDB), and table formats -- same data, same queries, different architecture. Side-by-side scorecard comparison.
Test at scale. Run the same workload at 10 GB, 100 GB, and 1 TB to find where throughput plateaus or resources saturate on your hardware.
Measure freshness. Sustained mode streams data through the pipeline and benchmarks query performance under sustained ingest load.

Quick Start

pip install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

pip install lakebench-k8s
lakebench init                           # quick setup (4 questions)
lakebench run lakebench.yaml --generate  # deploy + generate + pipeline + benchmark
lakebench results                        # view scorecard
lakebench destroy lakebench.yaml         # tear down everything

Minimum config -- 4 lines:

# lakebench.yaml
endpoint: http://s3.example.com:80
access_key: YOUR_KEY
secret_key: YOUR_SECRET
scale: 10                          # 1 = ~10 GB, 10 = ~100 GB, 100 = ~1 TB

Name is auto-generated. Recipe defaults to hive-iceberg-spark-trino. Override anything with flat fields or nested YAML:

# lakebench.yaml (with overrides)
name: flashblade-polaris
recipe: polaris-iceberg-spark-trino
endpoint: http://10.21.227.93:80
access_key: ${S3_ACCESS_KEY}       # env var substitution
secret_key: ${S3_SECRET_KEY}
scale: 50
mode: batch
spark_image: apache/spark:4.1.1-python3

Eleven recipes are available -- see Recipes for the full list.

Compare two configurations side-by-side:

lakebench compare config-hive.yaml config-polaris.yaml

For all recipes, see examples/ or run lakebench init --advanced for the full interactive wizard.

What You Get

After lakebench run completes, the terminal prints a scorecard:

 ─ Pipeline Complete ──────────────────────────────
  bronze-verify         142.0 s
  silver-build          891.0 s
  gold-finalize         234.0 s
  benchmark              87.0 s

  Scores
    Time to Value:        1354.0 s
    Throughput:           0.782 GB/s
    Efficiency:           3.41 GB/core-hr
    Scale:                100.0% verified
    QpH:                  2847.3

  Full report: lakebench report
 ──────────────────────────────────────────────────

lakebench report generates an HTML report with per-query latencies, bottleneck analysis, and optional platform metrics (CPU, memory, S3 I/O per pod).

How It Works

                    ┌──────────────────────────────────┐
                    │         lakebench.yaml           │
                    └────────────┬─────────────────────┘
                                 │
                    ┌────────────▼─────────────────────┐
                    │   deploy (Kubernetes namespace,   │
                    │   S3 secrets, PostgreSQL, catalog, │
                    │   query engine, observability)     │
                    └────────────┬─────────────────────┘
                                 │
     Raw Parquet ──► Bronze (validate) ──► Silver (enrich) ──► Gold (aggregate)
         S3              Spark                Spark               Spark
                                                                    │
                                                        ┌───────────▼──────────┐
                                                        │  8-query benchmark   │
                                                        │  (Trino / DuckDB /   │
                                                        │   Spark Thrift)      │
                                                        └──────────────────────┘

Prerequisites

kubectl and helm on PATH
Kubernetes 1.26+ (minimum 8 CPU / 32 GB RAM for scale 1)
S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)
Kubeflow Spark Operator 2.4.0+ (or set spark.operator.install: true)
Stackable Hive Operator for Hive recipes (not needed for Polaris)

Commands

Command	Description
`init`	Generate a starter config file
`validate`	Check config and cluster connectivity
`info`	Show deployment configuration summary
`deploy`	Deploy all infrastructure components
`generate`	Generate synthetic data at the configured scale
`run`	Execute the medallion pipeline and benchmark
`benchmark`	Run the 8-query benchmark standalone
`query`	Execute ad-hoc SQL against the active engine
`status`	Show deployment status
`report`	Generate HTML scorecard report
`recommend`	Recommend cluster sizing for a scale factor
`destroy`	Tear down all deployed resources

See CLI Reference for flags and options.

Component Versions

Component	Version
Apache Spark	3.5.4, 4.0.2, 4.1.1
Spark Operator	2.4.0 (Kubeflow)
Apache Iceberg	1.10.1
Delta Lake	4.0.0
Hive Metastore	3.1.3 (Stackable 25.7.0)
Apache Polaris	1.3.0-incubating
Trino	479
DuckDB	bundled (Python 3.11)
PostgreSQL	16, 17, 18

All versions are overridable in the YAML config. See Supported Components.

Documentation

Getting Started -- prerequisites, install, first run
Configuration -- full YAML reference
Recipes -- catalog + format + engine combinations
Compatibility Matrix -- Spark, Iceberg, and Delta version support
Running Pipelines -- batch and sustained modes
Benchmarking -- scorecard and query benchmark
Architecture -- system design
Troubleshooting -- common errors and fixes

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github/workflows		.github/workflows
datagen		datagen
docs		docs
examples		examples
src/lakebench		src/lakebench
tests		tests
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
install.sh		install.sh
lakebench.spec		lakebench.spec
lakebench.yaml		lakebench.yaml
lbrun.py		lbrun.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lakebench

Why Lakebench?

Quick Start

What You Get

How It Works

Prerequisites

Commands

Component Versions

Documentation

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lakebench

Why Lakebench?

Quick Start

What You Get

How It Works

Prerequisites

Commands

Component Versions

Documentation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages