Developed by
This project automates the extraction, transformation and analysis of music data to identify patterns between:
- Critical acclaim (Grammy Awards)
- Commercial popularity (Spotify)
- Community engagement (Last.fm)
ETL:
The pipeline performs the following steps:
-
Extract:
- Reads raw Spotify data from a CSV file.
- Extracts Grammy nomination data from a PostgreSQL database.
- Makes requests to the API to extract the data.
-
Transform:
- Cleans and preprocesses Spotify, Grammy and the API datasets.
- Merges the datasets to align Spotify artists with Grammy nominations and lastfm API metrics.
-
Load:
- Stores the final enriched dataset in a PostgreSQL database.
- Uploads the results to Google Drive.
- Python 3.8+
- Apache Airflow
- PostgreSQL
- Google Drive API credentials
git clone [https://github.com/valentinabc19/etl_workshop002]python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows Create a credentials.json file in the project root:
{
"db_host": "your_host",
"db_name": "your_db",
"db_user": "your_user",
"db_password": "your_password",
"db_port": "5432"
} Ensure this file is included in .gitignore.
pip install -r requirements.txt export AIRFLOW_HOME=$(pwd)/airflow
airflow db init
airflow webserver --port 8080
airflow schedulerPlace the raw Spotify dataset (spotify_dataset.csv) in data/raw.
Ensure the Grammy nominations data is stored in your PostgreSQL database under the table grammy_raw_data.
- Open your browser and navigate to:
http://localhost:8080 - Use the default credentials given in execution time of
airflow standalone
- In the Airflow UI, locate the
etl_workshop002DAG - Toggle the On/Off switch to enable it
- Click the "Trigger DAG" button to start execution
- Real-time tracking: View task status in the Grid View
- Detailed logs: Access execution logs under:
airflow/logs/etl__workshop002/
Note: Ensure PostgreSQL is running and accessible during execution.