Crypto Lakehouse: Real-Time Data Platform

A scalable, end-to-end Streaming Data Lakehouse built to capture, store, and visualize real-time cryptocurrency trade data from Binance.

Architecture

This project implements the "Lakehouse" architecture, combining the flexibility of a Data Lake with the management features of a Data Warehouse.

Zone 1: Producer

Role: Ingests raw trade data from Binance WebSocket.
Tech: Python, Kafka

Zone 2: The Lake

Role: Streams data into Iceberg tables on MinIO object storage.
Tech: Apache Spark, Iceberg, Nessie, MinIO

Zone 3: Warehouse

Role: Provides SQL interface to query data in the lake.
Tech: Trino, Nessie Catalog

Zone 4: Viz

Role: Visualizes trends and historical data.
Tech: Apache Superset

Quick Start

Prerequisites

Docker & Docker Compose
Python 3.10+ (Recommended: use 'uv' or 'venv')
Git

Start Infrastructure

Spin up the container cluster (Kafka, MinIO, Nessie, Trino, Superset):

docker-compose up -d --build

Start Data Ingestion

Terminal A: The Producer
(Connects to Binance and pushes trades to Kafka)

python ingestion/producer.py

Terminal B: The Spark Stream
(Reads from Kafka and commits parquet files to the Lake)

python processing/spark-job.py

Usage

Accessing the Dashboard

URL: http://localhost:8088
Login: admin / admin

Querying Data (SQL)

You can query data using Trino (via Superset SQL Lab or CLI).

Connection String (Superset):
trino://admin@trino:8080/nessie/crypto

Sample Query:

SELECT
  from_unixtime(timestamp / 1000) as event_time,
  symbol,
  price,
  volume
FROM binance_trades
ORDER BY timestamp DESC
LIMIT 10;

Time Travel

Since we use Nessie and Iceberg, you can query the database as it looked in the past.

Query the table state as of 5 minutes ago:

SELECT count(*)
FROM nessie.crypto.binance_trades
FOR TIMESTAMP AS OF (current_timestamp - interval '5' minute);

Project Structure

ingestion/ - Python scripts for fetching websocket data and Spark structured streaming jobs.
processing/ - Spark structured streaming jobs.
warehouse/ - Configuration for Trino and Nessie.
visualization/ - Custom Dockerfile for Superset (includes Trino drivers).
docker-compose.yml - Infrastructure definition.
requirements.txt - Python dependencies.
checkpoint_dir/ - Checkpoint data for streaming jobs.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ingestion		ingestion
processing		processing
visualization		visualization
warehouse/trino-config		warehouse/trino-config
.env		.env
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crypto Lakehouse: Real-Time Data Platform

Architecture

Zone 1: Producer

Zone 2: The Lake

Zone 3: Warehouse

Zone 4: Viz

Quick Start

Prerequisites

Start Infrastructure

Start Data Ingestion

Usage

Accessing the Dashboard

Querying Data (SQL)

Time Travel

Project Structure

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crypto Lakehouse: Real-Time Data Platform

Architecture

Zone 1: Producer

Zone 2: The Lake

Zone 3: Warehouse

Zone 4: Viz

Quick Start

Prerequisites

Start Infrastructure

Start Data Ingestion

Usage

Accessing the Dashboard

Querying Data (SQL)

Time Travel

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages