A scalable, end-to-end Streaming Data Lakehouse built to capture, store, and visualize real-time cryptocurrency trade data from Binance.
This project implements the "Lakehouse" architecture, combining the flexibility of a Data Lake with the management features of a Data Warehouse.
- Role: Ingests raw trade data from Binance WebSocket.
- Tech: Python, Kafka
- Role: Streams data into Iceberg tables on MinIO object storage.
- Tech: Apache Spark, Iceberg, Nessie, MinIO
- Role: Provides SQL interface to query data in the lake.
- Tech: Trino, Nessie Catalog
- Role: Visualizes trends and historical data.
- Tech: Apache Superset
- Docker & Docker Compose
- Python 3.10+ (Recommended: use 'uv' or 'venv')
- Git
Spin up the container cluster (Kafka, MinIO, Nessie, Trino, Superset):
docker-compose up -d --buildTerminal A: The Producer
(Connects to Binance and pushes trades to Kafka)
python ingestion/producer.pyTerminal B: The Spark Stream
(Reads from Kafka and commits parquet files to the Lake)
python processing/spark-job.py- URL: http://localhost:8088
- Login: admin / admin
You can query data using Trino (via Superset SQL Lab or CLI).
Connection String (Superset):
trino://admin@trino:8080/nessie/crypto
Sample Query:
SELECT
from_unixtime(timestamp / 1000) as event_time,
symbol,
price,
volume
FROM binance_trades
ORDER BY timestamp DESC
LIMIT 10;Since we use Nessie and Iceberg, you can query the database as it looked in the past.
Query the table state as of 5 minutes ago:
SELECT count(*)
FROM nessie.crypto.binance_trades
FOR TIMESTAMP AS OF (current_timestamp - interval '5' minute);ingestion/- Python scripts for fetching websocket data and Spark structured streaming jobs.processing/- Spark structured streaming jobs.warehouse/- Configuration for Trino and Nessie.visualization/- Custom Dockerfile for Superset (includes Trino drivers).docker-compose.yml- Infrastructure definition.requirements.txt- Python dependencies.checkpoint_dir/- Checkpoint data for streaming jobs.