Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e14ab36
feat: add script to check repository metrics via GitHub GraphQL API
Sep 22, 2025
461c5fc
chore: update permissions for metrics_check.sh script
Sep 22, 2025
c6ad63d
fix: make bash script more verbose with detailed GraphQL queries, res…
Sep 22, 2025
f9a09d7
refactor: Update script to gather comprehensive repository metrics in…
Sep 22, 2025
8717274
feat: Rewrite metrics check script in Python with GraphQL API integra…
Sep 22, 2025
60e46dd
feat: add metrics calls
Sep 22, 2025
2f1dc4f
refactor: extract graphql queries and modularize metrics check functions
Sep 22, 2025
632fd14
refactor: separate data retrieval and presentation logic in metrics_c…
Sep 22, 2025
50da3bb
fix: convert output to newline-delimited JSON format
Sep 22, 2025
c0eb9f0
refactor: convert metrics output to individual JSONL files per metric
Sep 22, 2025
a8f5448
refactor: Implement pagination and one-year date filtering for all Gi…
Sep 22, 2025
eb4bbe4
refactor: extract pagination logic into generic helper function and r…
Sep 22, 2025
7768cbd
feat: add comprehensive logging to monitor script execution and perfo…
Sep 22, 2025
8e99e2b
refactor: Separate concerns into distinct modules for GitHub client, …
Sep 22, 2025
a372d04
feat: add YAML configuration support for GitHub client and update met…
Sep 22, 2025
ed7f754
feat: add requirements.txt file
Sep 22, 2025
13c85a0
fix: add PyYAML to requirements and handle import error gracefully
Sep 22, 2025
eabf74e
fix: use smaller test repo
Sep 22, 2025
dddd9ec
refactor: extract configuration management into separate class
Sep 22, 2025
632fb7c
feat: make config file name configurable via parameter
Sep 22, 2025
348e9a7
feat: make output file names and directory paths configurable
Sep 22, 2025
074ccd7
refactor: Make GitHub API configuration values configurable through C…
Sep 22, 2025
c8f0187
fix: Pass required configuration object to GitHubClient constructor
Sep 22, 2025
3287b39
refactor: restructure config.yaml to use nested repo and output sections
Sep 22, 2025
99760c8
feat: Add support for repo owner/name and config file from environmen…
Sep 22, 2025
e13fc20
feat: put output in folder by default
Sep 22, 2025
d5819a1
feat: refactor folder structure
Sep 22, 2025
b7171c4
docs: add comprehensive README with installation, configuration, and …
Sep 22, 2025
c156f95
docs: update README to clarify markdown file collection and Python ve…
Sep 22, 2025
9823f10
feat: convert into single README file
Sep 22, 2025
565e5eb
fix: move GITHUB_TOKEN access from GitHubClient to Configuration class
Sep 22, 2025
a322fc6
feat: add support for GITHUB_TOKEN_FILE environment variable to read …
Sep 22, 2025
a963479
feat: ignore .token file
Sep 22, 2025
11095fc
feat: add Dockerfile for project containerization
Sep 22, 2025
5807a67
feat: security
Sep 22, 2025
625f86a
docs: add Docker container usage instructions to README.md
Sep 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# local helpers and temp files
.local
.workspace
.aider*
.env
.evidence
__pycache__
*.jsonl
node_modules
venv
output
.token
19 changes: 19 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
FROM python:3.13-slim

WORKDIR /app

COPY requirements.txt .

# Run as non-root user
RUN useradd --create-home --shell /bin/bash appuser \
&& mkdir -p output \
&& chown -R appuser:appuser output

USER appuser

RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/
COPY config/ ./config/

ENTRYPOINT ["python", "./src/metrics_check.py"]
218 changes: 218 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,221 @@
# GitHub Repository Metrics Collector

This project collects various metrics from GitHub repositories using the GitHub GraphQL API and outputs them as JSONL files. It's designed to help analyze repository activity, contributors, issues, and other key metrics.

## Features

- Collects information on markdown files at repository root (e.g. README.md, LICENSE.md)
- Retrieves repository license information
- Gathers list of releases, with timestamps
- Tracks contributors and their most recent contribution dates
- Collects commit history
- Records issue information including creators and status

## Prerequisites

- Python 3
- A GitHub personal access token with appropriate repository read permissions

## Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd <repository-directory>
```

2. Install the required dependencies:
```bash
pip install -r requirements.txt
```

## Configuration

The program can be configured using both a YAML configuration file and environment variables. Environment variables take precedence over the configuration file.

### Configuration File (config/config.yaml)

The default configuration file is located at `config/config.yaml`:

```yaml
repo:
owner: "duckdb" # Repository owner/organization
name: "duckdb-wasm" # Repository name

output:
directory: "output" # Output directory for JSONL files
root_md_files: "root_md_files.jsonl"
license: "license.jsonl"
releases: "releases.jsonl"
contributors: "contributors.jsonl"
commits: "commits.jsonl"
issues: "issues.jsonl"

github_api_url: "https://api.github.com/graphql"
pagination_limit: 100 # Number of items per API request
date_range_days: 365 # How far back to collect data (in days)
request_timeout: 30 # API request timeout in seconds
```

### Environment Variables

You can override configuration values using environment variables:

- `GITHUB_TOKEN`: (Required) Your GitHub personal access token
- `GITHUB_TOKEN_FILE`: (Alternative to `GITHUB_TOKEN`) Path to a file containing your GitHub personal access token
- `CONFIG_FILE`: Path to custom configuration file (optional)
- `REPO_OWNER`: Override repository owner from config
- `REPO_NAME`: Override repository name from config

Note: You must provide either `GITHUB_TOKEN` or `GITHUB_TOKEN_FILE`, but not both.

To create a GitHub personal access token:

1. Go to GitHub Settings > Developer settings > Personal access tokens
2. Generate a new token with repo scope
3. Copy the token for use with this application

To use a token file:
1. Create a file containing only your GitHub token (no extra characters or newlines)
2. Set the `GITHUB_TOKEN_FILE` environment variable to the path of this file

## Running the Program

### Running with Python

1. Set your GitHub token as an environment variable:
```bash
export GITHUB_TOKEN=your_github_token_here
```

OR set the path to a token file:
```bash
export GITHUB_TOKEN_FILE=/path/to/your/token/file
```

2. Run the metrics collection script:
```bash
python src/metrics_check.py
```

3. To use a custom configuration file:
```bash
export CONFIG_FILE=path/to/your/config.yaml
python src/metrics_check.py
```

4. To override repository owner/name:
```bash
export REPO_OWNER=your_owner
export REPO_NAME=your_repo
python src/metrics_check.py
```

### Running with Docker

You can run the application using the pre-built Docker image:

1. Pull the image:
```bash
docker pull codeberg.org/0xf1e/project-health-analyzer:latest
```

2. Run the container with your GitHub token:
```bash
docker run --rm \
-e GITHUB_TOKEN=your_github_token_here \
-v $(pwd)/output:/app/output \
codeberg.org/0xf1e/project-health-analyzer:latest
```

3. To use a token file:
```bash
docker run --rm \
-e GITHUB_TOKEN_FILE=/app/token.txt \
-v /path/to/your/token/file:/app/token.txt \
-v $(pwd)/output:/app/output \
codeberg.org/0xf1e/project-health-analyzer:latest
```

4. To use a custom configuration file:
```bash
docker run --rm \
-e GITHUB_TOKEN=your_github_token_here \
-v /path/to/your/config.yaml:/app/config/config.yaml \
-v $(pwd)/output:/app/output \
codeberg.org/0xf1e/project-health-analyzer:latest
```

5. To override repository owner/name:
```bash
docker run --rm \
-e GITHUB_TOKEN=your_github_token_here \
-e REPO_OWNER=your_owner \
-e REPO_NAME=your_repo \
-v $(pwd)/output:/app/output \
codeberg.org/0xf1e/project-health-analyzer:latest
```

## Output Files

All output files are saved in JSONL format (JSON Lines), with one JSON object per line. By default, files are saved to the `output/` directory.

### root_md_files.jsonl

Contains names of all markdown files in the repository root:

```json
{"file": "README.md"}
{"file": "CONTRIBUTING.md"}
```

### license.jsonl

Contains the repository license information:

```json
{"license": "MIT License"}
```

### releases.jsonl

Contains release information with timestamps:

```json
{"name": "v1.0.0", "publishedAt": "2023-01-15T10:30:00Z"}
```

### contributors.jsonl

Contains contributors with their most recent contribution date:

```json
{"login": "username", "last_contribution": "2023-05-20T14:22:30Z"}
```

### commits.jsonl

Contains commit information:

```json
{"message": "Fix bug in parser", "date": "2023-05-19T09:15:00Z", "author": "Developer Name"}
```

### issues.jsonl

Contains issue information:

```json
{"title": "Bug in authentication", "state": "CLOSED", "author": "user123", "createdAt": "2023-04-10T16:45:00Z"}
```

## Customization

You can modify the date range for data collection by changing the `date_range_days` value in the configuration. The default is 365 days (1 year).

The pagination limit can also be adjusted with `pagination_limit` to control how many items are fetched per API request.
---

# 🚀 Health Analyzer PoC
> Reducing Risk in Open Source Adoption

Expand Down
17 changes: 17 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
repo:
owner: "duckdb"
name: "duckdb-wasm"

output:
directory: "output"
root_md_files: "root_md_files.jsonl"
license: "license.jsonl"
releases: "releases.jsonl"
contributors: "contributors.jsonl"
commits: "commits.jsonl"
issues: "issues.jsonl"

github_api_url: "https://api.github.com/graphql"
pagination_limit: 100
date_range_days: 365
request_timeout: 30
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
requests==2.32.5
PyYAML==6.0.2
109 changes: 109 additions & 0 deletions src/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env python3

import yaml
import sys
import logging
import os
from typing import Optional

logger = logging.getLogger(__name__)

class Configuration:
def __init__(self, config_file: Optional[str] = None):
# Get config file path from environment variable or use default
self.config_file: str = config_file or os.environ.get('CONFIG_FILE', 'config/config.yaml')

# Initialize default values
self.owner: Optional[str] = None
self.repo_name: Optional[str] = None
self.output_dir: str = '.'
self.root_md_files_output: str = 'root_md_files.jsonl'
self.license_output: str = 'license.jsonl'
self.releases_output: str = 'releases.jsonl'
self.contributors_output: str = 'contributors.jsonl'
self.commits_output: str = 'commits.jsonl'
self.issues_output: str = 'issues.jsonl'

# GitHub API configuration
self.github_api_url: str = 'https://api.github.com/graphql'
self.github_token: Optional[str] = None
self.pagination_limit: int = 100
self.date_range_days: int = 365
self.request_timeout: int = 30

self._load_config(self.config_file)

# Override with environment variables if set
self.owner = os.environ.get('REPO_OWNER') or self.owner
self.repo_name = os.environ.get('REPO_NAME') or self.repo_name
self.github_token = self._get_github_token()

def _get_github_token(self) -> Optional[str]:
"""Get GitHub token from environment variables or config file"""
github_token = self.github_token # From config file

# Check for GITHUB_TOKEN environment variable
env_token = os.environ.get('GITHUB_TOKEN')

# Check for GITHUB_TOKEN_FILE environment variable
token_file_path = os.environ.get('GITHUB_TOKEN_FILE')

# Validate that only one token source is used
if env_token and token_file_path:
logger.error("Both GITHUB_TOKEN and GITHUB_TOKEN_FILE environment variables are set. Please use only one.")
sys.exit(1)

# Priority: GITHUB_TOKEN env var > GITHUB_TOKEN_FILE env var > config file
if env_token:
return env_token
elif token_file_path:
try:
with open(token_file_path, 'r') as f:
token = f.read().strip()
if not token:
logger.error(f"Token file {token_file_path} is empty.")
sys.exit(1)
return token
except FileNotFoundError:
logger.error(f"Token file {token_file_path} not found.")
sys.exit(1)
except Exception as e:
logger.error(f"Error reading token file {token_file_path}: {e}")
sys.exit(1)

return github_token

def _load_config(self, config_file: str) -> None:
"""Load configuration from YAML file"""
try:
with open(config_file, 'r') as f:
config = yaml.safe_load(f)

# Load repository configuration
repo_config = config.get('repo', {})
self.owner = repo_config.get('owner')
self.repo_name = repo_config.get('name')

# Load output configuration
output_config = config.get('output', {})
self.output_dir = output_config.get('directory', '.')
self.root_md_files_output = output_config.get('root_md_files', 'root_md_files.jsonl')
self.license_output = output_config.get('license', 'license.jsonl')
self.releases_output = output_config.get('releases', 'releases.jsonl')
self.contributors_output = output_config.get('contributors', 'contributors.jsonl')
self.commits_output = output_config.get('commits', 'commits.jsonl')
self.issues_output = output_config.get('issues', 'issues.jsonl')

# Load GitHub API configuration
self.github_api_url = config.get('github_api_url', 'https://api.github.com/graphql')
self.github_token = config.get('github_token') # Allow token in config file
self.pagination_limit = config.get('pagination_limit', 100)
self.date_range_days = config.get('date_range_days', 365)
self.request_timeout = config.get('request_timeout', 30)

except FileNotFoundError:
logger.error(f"{config_file} file not found.")
sys.exit(1)
except Exception as e:
logger.error(f"Error parsing {config_file}: {e}")
sys.exit(1)
Loading