A comprehensive real-time data streaming solution that ingests API data and processes it through a robust data engineering pipeline.
This project demonstrates an end-to-end data engineering pipeline that ingests data from APIs, processes it in real-time, and stores it for analysis. The architecture implements a scalable, fault-tolerant system using modern data engineering tools and technologies.
The pipeline follows these key steps:
- Fetch data from external APIs
- Stream the data through Kafka
- Process the streams with Spark
- Store processed data in Cassandra
- Orchestrate workflows with Airflow
- Docker: Containerization platform for consistent deployment across environments
- Apache Airflow: Workflow orchestration tool for scheduling and monitoring data pipelines
- Apache Kafka: Distributed streaming platform for high-throughput, fault-tolerant data ingestion
- Apache Spark: Unified analytics engine for large-scale data processing
- Apache Cassandra: NoSQL database for handling high-velocity data with no single point of failure
- PostgreSQL: Relational database for structured data storage and Airflow metadata
Data Flow:
- Ingestion Layer: External APIs → Kafka topics
- Processing Layer: Kafka → Spark Streaming
- Storage Layer: Processed data → Cassandra (NoSQL) and PostgreSQL (SQL)
- Orchestration Layer: Airflow manages the entire workflow
- Docker and Docker Compose
- Python 3.8+
- Internet connection (for API access)
- Basic understanding of data engineering concepts
├── dags/
│ └── kafka_stream.py # Airflow DAG for API data ingestion
├── scripts/
│ └── entrypoint.sh # Container initialization script
├── docker-compose.yml # Container orchestration
├── requirements.txt # Python dependencies
├── spark_stream.py # Spark streaming job
├── .env # Environment variables (git-ignored)
└── README.md # Project documentation
git clone https://github.com/shreyashreddyk/api-data-streaming.git
cd api-data-streamingpip install -r requirements.txtchmod +x scripts/entrypoint.shAPI_KEY=your_api_key_here
API_SECRET=your_api_secret_here
docker-compose up -dThis command spins up the following containers:
- Zookeeper (configuration service for Kafka)
- Kafka Broker (message broker)
- Schema Registry (metadata service)
- Control Center (Kafka monitoring UI)
- Spark Master & Worker (distributed processing)
- Cassandra (NoSQL database)
- Airflow Webserver & Scheduler
- PostgreSQL (relational database)
Navigate to http://localhost:8080 in your browser to access the Airflow web interface.
In the Airflow UI, locate the api_data_pipeline DAG and click the "Play" button to trigger it. This will:
- Call external APIs to fetch data
- Stream the data into Kafka topics
- Schedule the data processing workflow
Access the Kafka Control Center at http://localhost:9021 to monitor topics, consumers, and data flow through the system.
spark-submit --master spark://localhost:7077 spark_stream.pyThis starts the Spark streaming application that consumes data from Kafka, processes it according to business logic, and writes results to Cassandra.
Connect to the Cassandra cluster and check that data is being stored correctly:
docker exec -it cassandra cqlsh -u cassandra -p cassandra localhost 9042Then run queries to view your data:
DESCRIBE keyspaces;
USE api_data;
SELECT * FROM processed_data LIMIT 10;- Spark UI: Available at
http://localhost:9090 - Kafka Control Center: Available at
http://localhost:9021 - Airflow Dashboard: Available at
http://localhost:8080
- Container startup failures
- Check container logs:
docker-compose logs [service_name] - Verify that ports are not already in use
- Check container logs:
- API rate limiting
- Implement appropriate backoff strategies in your API calls
- Check API response codes and handle accordingly
- Kafka connection issues
- Ensure Zookeeper is running properly
- Check network connectivity between containers
- Spark job failures
- Review Spark UI for detailed error messages
- Verify JAR dependencies are correctly specified
- Ganiyu, Y. (2023). "Realtime Data Engineering Project With Airflow, Kafka, Spark, Cassandra and Postgres."
- Apache Kafka Documentation: https://kafka.apache.org/documentation/
- Apache Spark Documentation: https://spark.apache.org/docs/latest/
- Apache Airflow Documentation: https://airflow.apache.org/docs/
- Apache Cassandra Documentation: https://cassandra.apache.org/doc/latest/
