Skip to content

RenoX23/github-engineering-analytics-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GitHub Engineering Analytics Pipeline πŸš€

Overview

The GitHub Engineering Analytics Pipeline is a production-style Data Engineering project designed to ingest, process, transform, and orchestrate GitHub repository event data using a modern cloud-native data stack.

This project demonstrates:

  • End-to-end ETL pipeline engineering
  • Incremental and idempotent ingestion
  • Dockerized infrastructure
  • PostgreSQL-based warehouse architecture
  • Airflow orchestration
  • Transformation pipelines for analytics-ready data
  • Production-oriented engineering practices

The pipeline fetches GitHub repository activity data using the GitHub REST API, stores raw JSON events inside PostgreSQL, transforms them into normalized analytical tables, and orchestrates execution using Apache Airflow.


Architecture πŸ—οΈ

                    +----------------------+
                    |    GitHub REST API   |
                    +----------+-----------+
                               |
                               v
                    +----------------------+
                    |   Extract Layer      |
                    |  (Python Requests)   |
                    +----------+-----------+
                               |
                               v
                    +----------------------+
                    |   Raw Data Layer     |
                    | PostgreSQL JSONB     |
                    +----------+-----------+
                               |
                               v
                    +----------------------+
                    | Transformation Layer |
                    |  Python ETL Models   |
                    +----------+-----------+
                               |
                               v
                    +----------------------+
                    | Processed Warehouse  |
                    | Normalized Tables    |
                    +----------+-----------+
                               |
                               v
                    +----------------------+
                    | Apache Airflow DAG   |
                    | Orchestration Layer  |
                    +----------------------+

Key Features ✨

Data Engineering Features

  • GitHub API ingestion
  • Incremental loading
  • Pagination handling
  • Retry mechanism for API failures
  • Idempotent event loading
  • Metadata-driven ingestion tracking
  • JSONB raw storage
  • Structured transformation pipeline
  • Warehouse-style normalized schema

Infrastructure Features

  • Dockerized PostgreSQL
  • Dockerized Airflow orchestration
  • Python virtual environment setup
  • Modular project architecture
  • Environment variable management
  • Logging integration
  • Containerized execution workflows

Orchestration Features

  • Apache Airflow DAG scheduling
  • Retry policies
  • Task execution monitoring
  • Workflow automation
  • DAG-based orchestration

Technology Stack πŸ› οΈ

Category Technology
Programming Language Python 3.12
Database PostgreSQL
Orchestration Apache Airflow
Containerization Docker
ORM SQLAlchemy
API Integration GitHub REST API
Environment Management python-dotenv
Retry Handling Tenacity
Logging Python logging
Package Management pip
Workflow Scheduling Airflow DAGs

Project Structure πŸ“‚

github-engineering-analytics-pipeline/
β”‚
β”œβ”€β”€ airflow/
β”‚   β”œβ”€β”€ dags/
β”‚   β”‚   └── github_pipeline_dag.py
β”‚   β”œβ”€β”€ logs/
β”‚   └── plugins/
β”‚
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ extract/
β”‚   β”‚   β”œβ”€β”€ extractors.py
β”‚   β”‚   └── github_client.py
β”‚   β”‚
β”‚   β”œβ”€β”€ load/
β”‚   β”‚   β”œβ”€β”€ init_db.py
β”‚   β”‚   β”œβ”€β”€ metadata_loader.py
β”‚   β”‚   β”œβ”€β”€ processed_loader.py
β”‚   β”‚   β”œβ”€β”€ raw_loader.py
β”‚   β”‚   └── raw_reader.py
β”‚   β”‚
β”‚   β”œβ”€β”€ transform/
β”‚   β”‚   └── github_transformer.py
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ logger.py
β”‚   β”‚   └── retry_handler.py
β”‚   β”‚
β”‚   └── main.py
β”‚
β”œβ”€β”€ docker/
β”‚   └── airflow.Dockerfile
β”‚
β”œβ”€β”€ logs/
β”œβ”€β”€ screenshots/
β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ analytical_queries.sql
β”‚   └── schema.sql
β”‚
β”œβ”€β”€ .env
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ docker-compose.airflow.yml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ requirements-airflow.txt
└── README.md

Data Pipeline Flow πŸ”„

Step 1 β€” Extract

The pipeline fetches GitHub repository event data using the GitHub REST API.

Data Extracted

  • Push events
  • Pull request events
  • Repository metadata
  • Contributor activity
  • Event timestamps
  • Commit messages

Step 2 β€” Raw Data Layer

Raw GitHub events are stored in PostgreSQL using the JSONB datatype.

Benefits of JSONB Storage

  • Flexible schema handling
  • Full raw event retention
  • Replayable transformations
  • Semi-structured analytics support

Step 3 β€” Incremental Ingestion

The pipeline tracks previously ingested timestamps using a metadata table.

Benefits

  • Prevents duplicate ingestion
  • Supports incremental processing
  • Reduces API usage
  • Enables stateful execution

Step 4 β€” Transformation Layer

Raw events are transformed into analytics-ready warehouse tables.

Transformation Models

  • repositories
  • contributors
  • commits
  • pull_requests

Step 5 β€” Processed Warehouse Layer

Normalized tables are loaded into PostgreSQL.

Analytics Use Cases

  • Contributor activity analysis
  • Commit trend analysis
  • Pull request analysis
  • Repository engagement analytics

Step 6 β€” Orchestration

Apache Airflow orchestrates scheduled pipeline execution.

Airflow Responsibilities

  • Scheduling
  • Task retries
  • DAG orchestration
  • Workflow automation
  • Execution monitoring

Database Schema πŸ—„οΈ

Raw Layer

raw_github_events

Column Type
id SERIAL
github_event_id VARCHAR
event_type VARCHAR
repo_name VARCHAR
payload JSONB
created_at TIMESTAMP
ingested_at TIMESTAMP

Metadata Layer

ingestion_metadata

Column Type
id SERIAL
pipeline_name VARCHAR
last_event_created_at TIMESTAMP

Processed Layer

repositories

Column Type
repo_id SERIAL
repo_name VARCHAR
url TEXT

contributors

Column Type
contributor_id SERIAL
username VARCHAR

commits

Column Type
commit_id VARCHAR
repo_name VARCHAR
contributor_username VARCHAR
commit_message TEXT
commit_timestamp TIMESTAMP

pull_requests

Column Type
pr_id BIGINT
repo_name VARCHAR
contributor_username VARCHAR
pr_title TEXT
state VARCHAR
created_at TIMESTAMP
closed_at TIMESTAMP

Setup Instructions βš™οΈ

Prerequisites

Install:

  • Python 3.12+
  • Docker
  • Docker Compose
  • Git
  • PostgreSQL client tools (optional)

Clone Repository

git clone <repository_url>
cd github-engineering-analytics-pipeline

Create Virtual Environment

python -m venv venv
source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Configure Environment Variables

Create:

.env

Add:

GITHUB_TOKEN=your_github_token

POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=github_analytics
POSTGRES_HOST=localhost
POSTGRES_PORT=5432

Start PostgreSQL

docker compose up -d

Initialize Database Schema

python -m app.load.init_db

Run Pipeline

python -m app.main

Airflow Setup πŸŒͺ️

Build Airflow Environment

docker compose -f docker-compose.airflow.yml up --build -d

Initialize Airflow Database

docker compose -f docker-compose.airflow.yml run airflow-webserver airflow db migrate

Create Airflow Admin User

docker compose -f docker-compose.airflow.yml run airflow-webserver airflow users create \
  --username admin \
  --firstname admin \
  --lastname admin \
  --role Admin \
  --email admin@example.com \
  --password admin

Start Airflow Services

docker compose -f docker-compose.airflow.yml up -d

Access Airflow UI

http://localhost:8090

Credentials:

Username: admin
Password: admin

Sample Analytical Queries πŸ“Š

Most Active Contributors

SELECT
    contributor_username,
    COUNT(*) AS total_commits
FROM commits
GROUP BY contributor_username
ORDER BY total_commits DESC
LIMIT 10;

Most Active Repositories

SELECT
    repo_name,
    COUNT(*) AS total_events
FROM raw_github_events
GROUP BY repo_name
ORDER BY total_events DESC;

Pull Request State Distribution

SELECT
    state,
    COUNT(*) AS total_prs
FROM pull_requests
GROUP BY state;

Engineering Concepts Demonstrated 🧠

Data Engineering

  • ETL pipelines
  • Incremental loading
  • Idempotent ingestion
  • Data warehousing
  • Raw vs processed data layers
  • Transformation modeling
  • Metadata-driven pipelines

DevOps & Infrastructure

  • Docker containerization
  • Service orchestration
  • Infrastructure debugging
  • Dependency isolation
  • Container networking
  • Environment management

Workflow Orchestration

  • Airflow DAGs
  • Scheduling
  • Retry policies
  • Task execution
  • Workflow monitoring
  • Pipeline automation

Challenges Faced ⚠️

During development, several real-world infrastructure and orchestration challenges were encountered and resolved.

Key Engineering Challenges

  • Airflow dependency conflicts
  • SQLAlchemy compatibility issues
  • Docker networking behavior
  • Container permission management
  • Airflow orchestration debugging
  • Python module resolution
  • Containerized environment isolation

These issues provided hands-on experience with production-style infrastructure troubleshooting.


Known Infrastructure Limitation ⚠️

When executed inside Airflow containers, the pipeline may require container-to-host PostgreSQL networking adjustments depending on the local Docker or WSL environment configuration.

The ETL pipeline itself executes successfully in the local environment, and Airflow orchestration is fully configured.


Future Improvements πŸš€

Planned Enhancements

  • dbt integration
  • Star schema warehouse modeling
  • Kafka streaming ingestion
  • AWS/GCP deployment
  • Data quality testing
  • CI/CD integration
  • Grafana/Metabase dashboards
  • Kubernetes deployment
  • Prometheus monitoring
  • Great Expectations validation

Screenshots πŸ“Έ

SQL SCHEMA

CONTAINERS LOGS LOGIN AIRFLOW PIPELINE DAG

Recommended Screenshots

Add screenshots for:

  • Airflow DAG UI
  • PostgreSQL tables
  • Successful pipeline execution
  • Docker containers
  • DAG execution logs
  • Architecture diagram

Store screenshots inside:

screenshots/

Resume Value 🎯

This project demonstrates practical skills relevant for:

  • Data Engineering Internships
  • Junior Data Engineer roles
  • Analytics Engineering roles
  • DataOps roles
  • Platform Engineering internships

Skills Demonstrated πŸ“Œ

  • Python
  • SQL
  • PostgreSQL
  • Apache Airflow
  • Docker
  • SQLAlchemy
  • ETL Development
  • Data Warehousing
  • API Engineering
  • Workflow Automation
  • Infrastructure Debugging
  • Data Modeling

Learning Outcomes πŸ“š

This project provided hands-on experience with:

  • Building production-style ETL pipelines
  • Designing layered data architectures
  • Implementing orchestration workflows
  • Managing Dockerized infrastructure
  • Debugging Airflow runtime issues
  • Handling incremental ingestion logic
  • Building analytics-ready warehouse layers

Author πŸ‘¨β€πŸ’»

Renold

M.Tech Computer Science

Focused on:

  • Data Engineering
  • Cloud Infrastructure
  • DevOps
  • Analytics Engineering
  • Data Platforms

License πŸ“„

This project is intended for educational, learning, and portfolio purposes.

About

Production-style ETL pipeline for ingesting and analyzing GitHub engineering activity using Python, PostgreSQL, Docker, and Data Engineering workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors