A data pipeline to ingest datasets consist of tweets about Russian/Ukrainian war then perform some data cleaning and preprocessing and lastly perform sentiment analysis to build a general understanding of the sentiment of the tweets.
This is the end-to-end project to gain insight into the main sentiment of the Russian/Ukrainian war in the Twitter realm. In this project, I covered the scope of data engineering (building the data pipeline) & data analytics (building a visualization dashboard). I used Terraform to provision my cloud infrastructure, Apache Airflow to orchestrate the workflow of the pipeline, Apache Spark for data ingestion, preprocessing and extracting insights, and Google Data Studio to build the visualization dashboard.
- Perform sentiment analysis on +15M tweets about the Russian/Ukrainian war to gain a rough understanding of What is the public sentiment in Twitterverse about the ongoing conflict.
- Build A visualization dashboard to visualize the extracted insights.
Datasets contain tweets monitoring the current ongoing Ukraine-Russia conflict. Context and history of the current ongoing conflict can be found here. The dataset is available on Kaggle
- Terraform (IaC)
- Cloud: Google Cloud Platform (GCP)
- Managed Cloud Scheduling: Google Composer
- Managed Proccesing Cluster: Google Dataproc
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: Google Big Query (GBQ)
- Data Visualization: Google Data Studio (GDS)
- Orchestration: Apache Airflow
- Data Transformation: Apache Spark (Pyspark)
- Scripting Language: Python
I have created a an Airflow DAG that will create a dataproc cluster and run the data run Spark jobs that will perform the entire ETL process. The DAG consists of the following steps:
- create_dataproc_cluster:
- Create a DataProc cluster by useing "DataprocCreateClusterOperator" Airflow operator.
- ingest_data:
- read the data from the GCS bucket and ingest it into a Spark dataframe.
- perform some data cleaning, preprocessing.
- run the pretrained model to extract the sentiment of the tweets.
- upload the datafram data lake, google cloud storage (GCS) in parquet format.
- create_bigquery_table:
- gcs_bigquery: create the table in data warehouse, google big query (GBQ) by the parquet files in datalake.
- create_insights:
- read the data from the bigquery table and ingest it into a Spark dataframe.
- perform analysis on the dataframe to gain some useful insights.
- bigquery: create 4 tables in data warehouse containing the insights extracted from the datafram.
- delete_dataproc_cluster:
- Delete the Dataproc cluster after finishing all the jobs submitted to avoid any unnecessary costs.
I created a visualization dashboard consest of 4 tiles showing the end result of the project.
Tweets Sentiment Analysis Dashboard: url
Required Services:
- Google Cloud Platform - register an account with credit card GCP will free us $300 for 3 months
- Create a service account & download the keys as json file. The json file will be useful for further steps.
- Enable the API related to the services (Google Compute Engine, Google Cloud Storage, Cloud Composer, Cloud Dataproc & Google Big Query)
- follow the Local Setup for Terraform and GCP here
- copy your GCP project name into "project" var in variable.tf.
- run the following commands:
terraform initterraform apply -auto-approveProvisioning the cloud infrastructure may take up to 20 minutes.
- Copy the "pip-install.sh" file from your local storage into GS bucket scripts folder.
gsutil cp path_to_the_project/tweets_pipeline/airflow/scripts/pip-install.sh gs://GS_Bucket/scripts/- Copy the airflow dag file into the airflow/dags folder.
gsutil cp path_to_the_project/tweets_pipeline/airflow/dags/pipeline_dag.py gs://CLOUD_COMPOSER_BUCKET/dags/You can find the Airflow dag floder by going to Cloud Composer page and clicking on the "DAGs" tab and you can access Airflow UI web server by clicking on "Airflow" tab.
- Copy the Airflow dag tasks into GS bucket.
gsutil cp path_to_the_project/tweets_pipeline/airflow/tasks/* gs://GS_Bucket/pipeline_jobs/- From the Airflow UI page triger the pipeline_dag DAG and wait for it to finish.
This process may take up to 2.5 hours.
- Open Google Data Studio and create blank report then connect with bigquery.
It will ask you to authorise the access to Bigquery - do as instructed.
- Choose the project and add the following tables to the report:
- top10Hashtags
- count_tweets_over_time
- averall_sentiment_ratio_over_time
- overall_sentiment_ratio
- Add the 4 charts as shown on the dashboard I'm sharing here or you can just take copy of the shared dashboard and connect it to your tables.
- Add Streaming Pipeline (Apache Kafka) in this project to extract live tweets and procees them in realtime.
- Train a more accurate model to extract the sentiment of the tweets.
Huge thanks for DataTalks.Club for putting so much effort into these courses to contribute to the community. Can't thank you enough
Besides, Thanks to bwandowando for providing the Tweets Dataset. You're very kind to share your work with the public for academic purposes.


