Buenos Aires subway, also known as "Subte", was the first underground metro system in Latin America and the Southern Hemisphere. The goal of this project is to apply Data Engineering concepts to extract, process and visualize passengers distribution from 2018 and 2019 recorded by the turnstiles of the stations. The steps required for acomplishing this goal are the following:
- Upload all source data into a Data Lake.
- Format data and upload it into a Data Warehouse.
- Make a model with high-level transformations.
- Visualize data.
The dataset is obtained from Buenos Aires Goverment Data Site. The site provides CSV files with records of the turnstiles. Each record contains information about the turnstile and the amount of passengers that passed through it in a 15 minute range.
Data from 2018 and 2019 were used for this project because datasets from the following years vary their structure significantly.
- Cloud: Google Cloud Platform (GCP)
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Prefect
- Data Warehouse: BigQuery
- Batch processing: dbt
- Data Visualization: Google Looker Studio
Here is a dashboard made in Google Looker Studio based on the data model transformed with dbt. You can look it in more detail here.
And here is made with Microsoft Power BI
Reproducing this project has been tested on an Ubuntu 20.04 LTS VM, in Google Cloud. This project will require a similar VM, a Google Cloud account, a project on Google Cloud, and a service account with appropriate permissions for the project. Credits to MikeColemn for creating such clear steps for the VM setup!
-
If you need to set up a VM, an account, project, or service account on Google Cloud, see Setup Readme for more detailed instructions.
-
On your VM, clone the repo,
git clone https://github.com/EzicStar/BA-turnstiles-pipeline.git, and thencdinto the repo folder -
If you need to install Google Cloud CLI, Anaconda, and Terraform, you can run a bash script with this command,
bash ./setup/setup.sh, which will perform the following actions:-
Apply initial updates, and install
-
Install Google Cloud cli application
-
Setup Anaconda and Terraform.
-
(Note) This may take a little time to process and if you see any prompts from updates, you can hit OK on the prompts and
ffor the MORE prompt for the Anaconda setup
-
-
Setup your conda virtual environment with the following commands:
source ~/.bashrc- (if you just installed Anaconda above, and haven't restarted your shell session)conda create -n baturnstiles python=3.9 -yconda activate baturnstilespip install -r ./setup/requirements.txt
-
Save your Google Cloud service account .json file to the ./creds folder. You can sftp the file from your local computer to that location. You could even just open the file on your local computer, copy the contents of the file and do
nano ./creds/[filename].jsonon the VM and paste in the contents into this new blank file, and then do CTRL + X, and thenYand ENTER, to save and exit the file. -
Set an environment variable for your service account file that you just saved with this command:
export GOOGLE_APPLICATION_CREDENTIALS="<absolute path to the json file in the ./creds folder>" -
Update the GOOGLE_APPLICATION_CREDENTIALS environment variable in the ./.env file, using the same absolute path to the .json service account file
-
Run
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALSto authenticate with Google Cloud, with your service account .json file. -
Run Terraform to deploy your infrastructure to your Google Cloud Project. Run the following commands:
terraform -chdir="./Terraform" init- to initialize terraformterraform -chdir="./Terraform" plan -var="project=<project id here>", replacing with your Google Project ID. This will build a deployment plan that you can review.terraform -chdir="./Terraform" apply -var="project=<project id here>", replacing with your Google Project ID. This will apply the deployment plan and deploy the infrastructure
-
Run the following commands to set up a local Prefect profile
prefect profile create baturnstilesprefect profile use baturnstilesprefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
-
Open two new terminal windows and ssh into your VM. These additional terminals are going to be for launching the Prefect orion server, and to launch a work queue, which will process deployed pipelines
-
Additional window 1:
conda activate baturnstilesprefect orion start
-
Additional window 2:
conda activate baturnstilesprefect agent start --work-queue "default"
-
-
From your original terminal session, run this command to setup some blocks for your GCP credentials in prefect:
python ./setup/prefect_setup_blocks.py. If a KeyError raises, you may repeat step 6. -
Add your project ID in
write_bq()function frometl_gcs_to_bq.py. You can do this running:nano ./Pipelines etl_gcs_to_bq.pyand changing the string where it says'your-project-id'. -
From your original terminal session, run the following three commands to deploy the pipeline to Prefect and then run it for all years of data
prefect deployment build Pipelines/parent_flow.py:parent_flow -n "Baturnstiles-ETL"prefect deployment apply parent_flow-deployment.yamlprefect deployment run parent-flow/Baturnstiles-ETL -p "years=[2018, 2019]"
This may take 10-20 minutes to run the full pipeline. You can switch to the terminal session for the work queue to watch the progress if you like. After this step you can already shut down the VM to avoid costs.
-
In Google Cloud, go to Big Query workspace and create a new query tab. Copy the content from queries.txt and run the first SQL instruction to create a partitioned and clustered table. Below you can see some queries that prove the increase in performance using a partitioned and clustered table.
-
Make high-level transformations of the data using dbt. Create a dbt account here.
-
Connect to BigQuery and add a repository following these instructions. Creating a new repo is recommended.
-
Add the seeds and model in the folder named dbt to your new repo. Then load seeds using
dbt seed. -
Run the model using
dbt run.
Important note: Once you're done evaluating this project, make sure to stop and remove any cloud resources. If you're using a cloud VM, make sure to stop it in your VM Instances screen in Google Cloud Console, and potentially delete it if you no longer want it. This way it's not up and running, and using up your credits. In addition, you can use Terraform to destroy your buckets and datasets, with
terraform -chdir="./terraform" destroy -var="project=<project id here>"
- Add tests
- Add a CI/CD Pipeline
- Dockerise prefect orion and agent server
- Adapt ingestion for different CSV formats submitted in BA Open Data


