WARNING: This project only runs on ARM64 chips.
- Project Objective
- Datasets Selection
- System Architecture
- Technologies Used
- Installation and Deployment
- Troubleshooting
- Results and Analysis
- Future Work
- References
- Authors
The Covid Data Process project aims to design and implement a comprehensive, real-time data processing pipeline specifically tailored to manage the continuous influx of COVID-19 data. This project seeks to build a scalable and efficient system capable of ingesting, processing, storing, and visualizing COVID-19 data in real-time, enabling stakeholders to make informed decisions based on the most current information available.
The dataset, sourced from the COVID-19 API, offers comprehensive and regularly updated reports on the global spread and impact of COVID-19. It encompasses a wide range of data points that track the pandemic's progression across various regions, including countries, states, and provinces. This dataset captures essential metrics such as the number of confirmed cases, deaths, recoveries, and active cases, providing a granular view of the pandemic's evolution over time.
Data from API format example:
{
"data":
[
0:{
"date":"2023-03-09"
"confirmed":209451
"deaths":7896
"recovered":0
"confirmed_diff":0
"deaths_diff":0
"recovered_diff":0
"last_update":"2023-03-10 04:21:03"
"active":201555
"active_diff":0
"fatality_rate":0.0377
"region":{
"iso":"AFG"
"name":"Afghanistan"
"province":""
"lat":"33.9391"
"long":"67.7100"
"cities":[]
}
...
]
}
The system architecture for this COVID-19 data processing pipeline is designed to ensure efficient data ingestion, processing, storage, and visualization. It leverages a combination of open-source technologies and cloud services to provide a scalable, robust, and flexible framework for managing and analyzing large volumes of real-time data.
The system is divided into several components, each responsible for specific tasks within the data process:
-
Data Source:
COVID-19data is retrieved from the APIhttps://covid-api.com/api/using HTTP requests.NiFihandles data ingestion, performing initial cleaning and transformation to prepare the data for further processing. -
Producer/Consumer:
NiFiacts as both producer and consumer, forwarding processed data toApache Kafkafor streaming.
-
Message Brokering:
Kafkaserves as the message broker, streaming data between system components in real-time. -
Monitoring:
RedpandamonitorsKafka’s performance, ensuring system stability. -
Streaming Analytics:
Spark Streamingprocesses the data in real-time, performing computations like aggregations and filtering as data flows throughKafka.
-
Distributed Storage: Data is stored in
Hadoop HDFS, providing scalable, reliable storage. -
Data Warehousing:
Apache HiveonHDFSenables efficient querying of large datasets.
-
Job Scheduling:
Airfloworchestrates and schedules the system’s workflows, ensuring smooth execution of data ingestion, processing, and storage tasks. -
Batch processing:
Apache Sparkprocessing on data stored in HDFS, facilitating complex data analysis tasks.
- Consistency & Deployment:
Dockercontainers ensure consistent environments across development, testing, and production, and are deployed onAWS EC2for scalability.
- Interactive Dashboards:
Amazon QuickSightvisualizes the processed data, allowing for the creation of interactive dashboards and reports.
Amazon EC2: Hosts the system in a scalable and flexible cloud environment.Docker: Containerizes the system components, ensuring consistency and easy deployment.
Apache NiFi: Handles data ingestion and initial processing from the COVID-19 API.Apache Kafka: Enables real-time data streaming between system components.Redpanda: Monitors Kafka to ensure stable data flow and system performance.Apache Spark: For both real-time and batch data processing.Hadoop HDFS: Provides distributed storage for large volumes of processed data.Apache Hive: Allows SQL-like querying and analysis of data stored in HDFS.Apache Airflow: rchestrates and schedules the workflow of the entire system.
Amazon QuickSight: Provides business intelligence and data visualization capabilities for insightful reporting and analysis.
1.1. Log in to AWS Management Console:
- Visit the AWS Management Console and log in with your credentials.
1.2. Launch a New EC2 Instance:
-
Navigate to the EC2 Dashboard.
-
Click on Launch Instance.
-
Choose an Amazon Machine Image (AMI):
-
Choose an Architecture
-
Choose an Instance Type:
-
Configure Instance Details:
- Ensure that Auto-assign Public IP is set to "Enable".
-
Configure Storage:
-
Configure Security Group:
-
Review and Launch the instance.
-
Download the key pair (
.pemfile) and keep it safe; it’s needed for SSH access.
1.3. Access the EC2 Instance
-
Open your terminal.
-
Navigate to the directory where the
.pemfile is stored. -
Run the following command to connect to your instance:
ssh -i "your-key-file.pem" ec2-user@your-ec2-public-ip
2.1: Update the Package Repository
-
Run the following commands to ensure your package repository is up to date:
sudo yum update -y
2.2. Install Docker
-
Install
Dockerby running the following commands:sudo yum install -y docker
-
Start
Dockerand enable it to start at boot:sudo systemctl start docker sudo systemctl enable docker -
Verify
Dockerinstallation:docker --version
3.1. Install Docker Compose
-
Docker Composeis not available in the defaultAmazon Linux 2repositories, so you will need to download it manually:sudo curl -L "https://github.com/docker/compose/releases/download/v2.19.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
3.2. Apply Executable Permissions
-
Apply executable permissions to the binary:
sudo chmod +x /usr/local/bin/docker-compose
3.3. Verify Docker Compose Installation
-
Verify the installation by checking the version:
docker-compose --version
1.1. Install Git (if not already installed)
-
Install Git to clone the repository:
sudo yum install -y git
1.2. Clone the Repository
-
Run the following command to clone the project repository:
git clone https://github.com/your-username/covid-data-process.git
-
Navigate to the project directory:
cd covid-data-process
2.1. Port forwarding to the AWS EC2 Instance
- Use the SSH command provided:
ssh -i "your-key-file.pem" \ -L 6060:localhost:6060 \ -L 8088:localhost:8088 \ -L 8080:localhost:8080 \ -L 1014:localhost:1010 \ -L 9999:localhost:9999 \ -L 8443:localhost:8443 \ ec2-user@eec2-user@your-ec2-public-ip- This command establishes a secure SSH connection to your EC2 instance and sets up port forwarding, allowing you to access various services running on your instance from your local machine.
2.2. Grant Execution Permissions for Shell Scripts
-
Once connected, ensure all shell scripts have the correct execution permissions:
chmod +x ./*- This command grants execute permissions to all .sh files in the project directory, allowing them to be run without any issues.
2.3. Initialize the Project Environment
-
Run the
init-pro.shscript to initialize the environment:./init-pro.sh
-
This script performs the following tasks:
- Sets the
AIRFLOW_UIDenvironment variable to match the current user's ID. - Creates necessary directories for
SparkandNiFi. - Prepares the environment for running
Airflowby initializing it withDocker.
- Sets the
2.4. Start the Docker Containers
-
Next, bring up all the services defined in the
docker-composefile:docker-compose --profile all up
- This command starts all the necessary containers for the project, including those for NiFi, Kafka, Spark, Hadoop, Hive, and Airflow.
2.5. Post-Deployment Configuration
-
After the
Docker containersare running, execute theafter-compose.shscript to perform additional setup:./after-compose.sh
-
This script:
- Sets up
HDFS directoriesand assigns the appropriate permissions. - Creates
Kafka topics(covidin and covidout). - Submits a
Spark jobto process streaming data.
- Sets up
2.6. Configure and Run Apache NiFi
- Now, you need to configure and start your data workflows in
Apache NiFi:-
Open the
NiFi Web UIby navigating tohttp://localhost:8080/nifiin your browser. -
Add the template for the
NiFiworkflow:
-
Start the
NiFiworkflow to begin ingesting and processingCOVID-19data.
-
2.7. Monitor Kafka Using Redpanda
- To monitor
Kafka, you can use Redpanda, which provides an easy-to-use interface forKafkamonitoring:
2.7. Finalize Setup and Execute SQL Commands
-
Once the
NiFi workflowis running, execute the final setup scriptafter-nifi.sh:./after-nifi.sh
-
This script:
- Ensures that all
HDFS directorieshave the correct permissions. - Starts the
SSH servicein the Spark master container. - Runs the
SQLscript inSparkto set up the necessary databases and tables.
- Ensures that all
2.8. Run Apache Airflow DAGs
-
Open the
Airflow Web UIby navigating tohttp://localhost:8080in your browser -
Add a new connection and fill in the following details:
-
Activate the two
DAGsthat were set up for this project.
2.9. Visualize Data in AWS QuickSight
-
Log in to your
AWS QuickSightaccount. -
Connect to the
Hivedatabase to access the processedCOVID-19data. -
Create dashboards and visualizations to analyze and present the data insights.
1. Issue: Docker Containers Fail to Start
- Symptom: When running
docker-compose --profile all up, one or more containers fail to start. - Solution:
-
Check if Docker is running properly on your
EC2 instanceby usingdocker ps.
-
Review the
docker-compose logsusingdocker-compose logsto identify the cause of the failure. -
Ensure that the
.envfile is correctly set up, especially theAIRFLOW_UIDvariable. -
Verify that the necessary directories (
nifi,spark, etc.) have been created with the correct permissions usingchmod -R 777.
-
2. Issue: Airflow DAGs Fail to Trigger
- Symptom: DAGs in
Airflowdo not execute when triggered. - Solution:
- Ensure the connection to
Spark(spark_conn) is correctly configured inAirflowby checking it under Admin > Connections. - Review the
Airflow logsfor errors related toSparkorKafkaintegration. - Restart the
Airflow servicesusingdocker-compose restart airflow.
- Ensure the connection to
3. Issue: Data Not Appearing in HDFS
- Symptom: Data is not visible in the
HDFSdirectories even after theNiFiworkflow is running. - Solution:
- Check the
NiFilogs to ensure that the processors are correctly writing data toHDFS. - Verify that HDFS directories
/data/covid_dataand/data/kafka_dataexist and have the correct permissions. - Manually inspect the
HDFSfile system using the commanddocker exec -it namenode hdfs dfs -ls /data/to check for the presence of data files.
- Check the
4. Issue: Spark Jobs Fail or Hang
- Symptom:
Spark jobsdo not complete or get stuck in a certain stage. - Solution:
- Monitor the
Spark Web UIfor any signs of resource exhaustion (e.g., memory or CPU limits). - Check the
Spark logsfor errors that might indicate misconfiguration or issues with input data. - Restart the
Sparkservices and re-submit the jobs if necessary.
- Monitor the
1. Viewing Docker Container Logs
-
Command: To view logs from a specific container, use:
docker logs <container_name>
2. Airflow Logs
-
Accessing Logs: In the
Airflow Web UI, navigate to the specific task instance under theDAGruns.
-
Direct Logs: Logs are also available directly within the container:
docker exec -it airflow-worker tail -f /opt/airflow/logs/<dag_id>/<task_id>/log.txt
3. NiFi Logs
-
View
NiFilogs by accessing theNiFicontainer:docker exec -it nifi tail -f /nifi/logs/nifi-app.log -
View folder mounted:
nifi/logs/ -
Web UI: You can also monitor processor status and logs directly within the
NiFi Web UIby clicking on individual processors.
4. Kafka Logs
-
Command: Access
Kafkalogs by connecting to theKafkacontainer:docker exec -it kafka tail -f /var/log/kafka/kafkaServer.log
5. Monitoring Kafka with Redpanda
-
Access: Navigate to
http://localhost:1010to access theRedpanda Console. -
Usage: Use the
Redpanda Consoleto monitorKafka brokers, topics, and consumer groups.
6. Spark Job Monitoring with Spark Web UI
-
Access: Navigate to
http://localhost:8088to open theSpark Web UI. -
Usage: The
Spark Web UIprovides detailed information about the running and completedSparkjobs, including stages, tasks, and executor logs.
7. Running Jupyter Lab for Testing
-
Open your web browser and go to
http://localhost:9999
-
Run the following script to retrieve the token required to log into
Jupyter Lab:./token_notebook.sh -
Paste the token obtained from the
token_notebook.shscript to log in.
By connecting the processed data to AWS QuickSight, the project provides a powerful visualization platform that enables stakeholders to derive meaningful insights from the data.
- As data volumes grow, optimizing the pipeline for scalability will be crucial. Implementing advanced partitioning strategies in
Kafkaand improving the resource allocation forSparkjobs could help the system handle larger datasets more efficiently. - Exploring container orchestration tools like
Kubernetescould also enable better management and scaling ofDockercontainers across multiple nodes.
- Integrating more advanced analytics, such as
machine learningmodels for predicting future COVID-19 trends, could provide deeper insights. This would involve extending theSparkjobs to include model training and inference within the pipeline. - Implementing a dashboard that provides
real-time visualizationsof predictive analytics alongside the current trends would add significant value.
- Expanding the pipeline to ingest and process data from additional sources, such as vaccination data or mobility reports, could provide a more comprehensive view of the pandemic's impact.
- Incorporating external data sources like weather or demographic data could also allow for more granular analyses and correlations.
- Enhancing the monitoring setup by integrating automated alerting mechanisms, such as those provided by Prometheus and Grafana, could help quickly identify and respond to any issues within the pipeline.
- Implementing detailed logging and tracking of data lineage would improve transparency and make troubleshooting easier.
- Fully migrating the project to a cloud-native architecture using managed services like
AWS MSK(Managed Streaming for Kafka),AWS EMR(Elastic MapReduce), andAWS Gluecould simplify maintenance and improve overall performance. - Leveraging
AWS Lambdafor serverless processing tasks could reduce operational costs and improve the responsiveness of the system.
- Adding layers of data privacy and security, such as encryption at rest and in transit, as well as implementing fine-grained access controls, would ensure the data pipeline meets stringent compliance requirements.
- Implementing audit trails and data masking techniques could further protect sensitive information.
Nguyen Trung Nghia
- Contact: [email protected]
- GitHub: Ren294
- Linkedln: tnghia294








