The GitHub Engineering Analytics Pipeline is a production-style Data Engineering project designed to ingest, process, transform, and orchestrate GitHub repository event data using a modern cloud-native data stack.
This project demonstrates:
- End-to-end ETL pipeline engineering
- Incremental and idempotent ingestion
- Dockerized infrastructure
- PostgreSQL-based warehouse architecture
- Airflow orchestration
- Transformation pipelines for analytics-ready data
- Production-oriented engineering practices
The pipeline fetches GitHub repository activity data using the GitHub REST API, stores raw JSON events inside PostgreSQL, transforms them into normalized analytical tables, and orchestrates execution using Apache Airflow.
+----------------------+
| GitHub REST API |
+----------+-----------+
|
v
+----------------------+
| Extract Layer |
| (Python Requests) |
+----------+-----------+
|
v
+----------------------+
| Raw Data Layer |
| PostgreSQL JSONB |
+----------+-----------+
|
v
+----------------------+
| Transformation Layer |
| Python ETL Models |
+----------+-----------+
|
v
+----------------------+
| Processed Warehouse |
| Normalized Tables |
+----------+-----------+
|
v
+----------------------+
| Apache Airflow DAG |
| Orchestration Layer |
+----------------------+
- GitHub API ingestion
- Incremental loading
- Pagination handling
- Retry mechanism for API failures
- Idempotent event loading
- Metadata-driven ingestion tracking
- JSONB raw storage
- Structured transformation pipeline
- Warehouse-style normalized schema
- Dockerized PostgreSQL
- Dockerized Airflow orchestration
- Python virtual environment setup
- Modular project architecture
- Environment variable management
- Logging integration
- Containerized execution workflows
- Apache Airflow DAG scheduling
- Retry policies
- Task execution monitoring
- Workflow automation
- DAG-based orchestration
| Category | Technology |
|---|---|
| Programming Language | Python 3.12 |
| Database | PostgreSQL |
| Orchestration | Apache Airflow |
| Containerization | Docker |
| ORM | SQLAlchemy |
| API Integration | GitHub REST API |
| Environment Management | python-dotenv |
| Retry Handling | Tenacity |
| Logging | Python logging |
| Package Management | pip |
| Workflow Scheduling | Airflow DAGs |
github-engineering-analytics-pipeline/
β
βββ airflow/
β βββ dags/
β β βββ github_pipeline_dag.py
β βββ logs/
β βββ plugins/
β
βββ app/
β βββ config/
β βββ extract/
β β βββ extractors.py
β β βββ github_client.py
β β
β βββ load/
β β βββ init_db.py
β β βββ metadata_loader.py
β β βββ processed_loader.py
β β βββ raw_loader.py
β β βββ raw_reader.py
β β
β βββ transform/
β β βββ github_transformer.py
β β
β βββ utils/
β β βββ logger.py
β β βββ retry_handler.py
β β
β βββ main.py
β
βββ docker/
β βββ airflow.Dockerfile
β
βββ logs/
βββ screenshots/
βββ sql/
β βββ analytical_queries.sql
β βββ schema.sql
β
βββ .env
βββ docker-compose.yml
βββ docker-compose.airflow.yml
βββ requirements.txt
βββ requirements-airflow.txt
βββ README.md
The pipeline fetches GitHub repository event data using the GitHub REST API.
- Push events
- Pull request events
- Repository metadata
- Contributor activity
- Event timestamps
- Commit messages
Raw GitHub events are stored in PostgreSQL using the JSONB datatype.
- Flexible schema handling
- Full raw event retention
- Replayable transformations
- Semi-structured analytics support
The pipeline tracks previously ingested timestamps using a metadata table.
- Prevents duplicate ingestion
- Supports incremental processing
- Reduces API usage
- Enables stateful execution
Raw events are transformed into analytics-ready warehouse tables.
- repositories
- contributors
- commits
- pull_requests
Normalized tables are loaded into PostgreSQL.
- Contributor activity analysis
- Commit trend analysis
- Pull request analysis
- Repository engagement analytics
Apache Airflow orchestrates scheduled pipeline execution.
- Scheduling
- Task retries
- DAG orchestration
- Workflow automation
- Execution monitoring
| Column | Type |
|---|---|
| id | SERIAL |
| github_event_id | VARCHAR |
| event_type | VARCHAR |
| repo_name | VARCHAR |
| payload | JSONB |
| created_at | TIMESTAMP |
| ingested_at | TIMESTAMP |
| Column | Type |
|---|---|
| id | SERIAL |
| pipeline_name | VARCHAR |
| last_event_created_at | TIMESTAMP |
| Column | Type |
|---|---|
| repo_id | SERIAL |
| repo_name | VARCHAR |
| url | TEXT |
| Column | Type |
|---|---|
| contributor_id | SERIAL |
| username | VARCHAR |
| Column | Type |
|---|---|
| commit_id | VARCHAR |
| repo_name | VARCHAR |
| contributor_username | VARCHAR |
| commit_message | TEXT |
| commit_timestamp | TIMESTAMP |
| Column | Type |
|---|---|
| pr_id | BIGINT |
| repo_name | VARCHAR |
| contributor_username | VARCHAR |
| pr_title | TEXT |
| state | VARCHAR |
| created_at | TIMESTAMP |
| closed_at | TIMESTAMP |
Install:
- Python 3.12+
- Docker
- Docker Compose
- Git
- PostgreSQL client tools (optional)
git clone <repository_url>
cd github-engineering-analytics-pipelinepython -m venv venv
source venv/bin/activatepip install -r requirements.txtCreate:
.env
Add:
GITHUB_TOKEN=your_github_token
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=github_analytics
POSTGRES_HOST=localhost
POSTGRES_PORT=5432docker compose up -dpython -m app.load.init_dbpython -m app.maindocker compose -f docker-compose.airflow.yml up --build -ddocker compose -f docker-compose.airflow.yml run airflow-webserver airflow db migratedocker compose -f docker-compose.airflow.yml run airflow-webserver airflow users create \
--username admin \
--firstname admin \
--lastname admin \
--role Admin \
--email admin@example.com \
--password admindocker compose -f docker-compose.airflow.yml up -dhttp://localhost:8090
Credentials:
Username: admin
Password: admin
SELECT
contributor_username,
COUNT(*) AS total_commits
FROM commits
GROUP BY contributor_username
ORDER BY total_commits DESC
LIMIT 10;SELECT
repo_name,
COUNT(*) AS total_events
FROM raw_github_events
GROUP BY repo_name
ORDER BY total_events DESC;SELECT
state,
COUNT(*) AS total_prs
FROM pull_requests
GROUP BY state;- ETL pipelines
- Incremental loading
- Idempotent ingestion
- Data warehousing
- Raw vs processed data layers
- Transformation modeling
- Metadata-driven pipelines
- Docker containerization
- Service orchestration
- Infrastructure debugging
- Dependency isolation
- Container networking
- Environment management
- Airflow DAGs
- Scheduling
- Retry policies
- Task execution
- Workflow monitoring
- Pipeline automation
During development, several real-world infrastructure and orchestration challenges were encountered and resolved.
- Airflow dependency conflicts
- SQLAlchemy compatibility issues
- Docker networking behavior
- Container permission management
- Airflow orchestration debugging
- Python module resolution
- Containerized environment isolation
These issues provided hands-on experience with production-style infrastructure troubleshooting.
When executed inside Airflow containers, the pipeline may require container-to-host PostgreSQL networking adjustments depending on the local Docker or WSL environment configuration.
The ETL pipeline itself executes successfully in the local environment, and Airflow orchestration is fully configured.
- dbt integration
- Star schema warehouse modeling
- Kafka streaming ingestion
- AWS/GCP deployment
- Data quality testing
- CI/CD integration
- Grafana/Metabase dashboards
- Kubernetes deployment
- Prometheus monitoring
- Great Expectations validation
Add screenshots for:
- Airflow DAG UI
- PostgreSQL tables
- Successful pipeline execution
- Docker containers
- DAG execution logs
- Architecture diagram
Store screenshots inside:
screenshots/
This project demonstrates practical skills relevant for:
- Data Engineering Internships
- Junior Data Engineer roles
- Analytics Engineering roles
- DataOps roles
- Platform Engineering internships
- Python
- SQL
- PostgreSQL
- Apache Airflow
- Docker
- SQLAlchemy
- ETL Development
- Data Warehousing
- API Engineering
- Workflow Automation
- Infrastructure Debugging
- Data Modeling
This project provided hands-on experience with:
- Building production-style ETL pipelines
- Designing layered data architectures
- Implementing orchestration workflows
- Managing Dockerized infrastructure
- Debugging Airflow runtime issues
- Handling incremental ingestion logic
- Building analytics-ready warehouse layers
Renold
M.Tech Computer Science
Focused on:
- Data Engineering
- Cloud Infrastructure
- DevOps
- Analytics Engineering
- Data Platforms
This project is intended for educational, learning, and portfolio purposes.





