This repository contains a backend-only scanner that:
- Starts from a hackathon URL (gallery or event page).
- Discovers project detail pages (gallery-first, with heuristics for sites like Devfolio).
- Extracts GitHub repository links from each project page.
- Fetches repository metadata: README, commit timeline and a file/folder tree.
- Calls an LLM to evaluate the README and separately to evaluate repository folder organization.
- Emits a clean JSON results file per run under
results/.
The project is intended as a server-side tool (no frontend) for programmatic audit/triage of hackathon submissions.
- Playwright-based rendering for JS-heavy gallery pages (
app/renderer.py). - Gallery-first discovery and site heuristics for Devpost/Devfolio (
scripts/run_eval_from_gallery.py). - GitHub helpers to fetch README, list commits, and retrieve repo tree (
app/github.py). - LLM adapters supporting Groq (OpenAI-compatible) and Hugging Face (
app/llm.py). - Structure evaluation (folder-organization scoring) via LLM (
evaluate_structure). - Output: per-repo JSON records with README and structure evaluations.
- Python 3.10 or newer
- Install runtime deps:
python -m pip install -r requirements.txtCreate a .env (copy .env.example) and set the provider credentials you plan to use.
EVAL_PROVIDER(optional):groqorhf(helps selection). Defaults to automatic detection.EVAL_API_URL(for Groq/OpenAI-compatible endpoints): e.g.https://api.groq.com/openai/v1/chat/completions.EVAL_API_KEYorGROQ_API_KEY: API key for the eval provider.EVAL_MODEL: model id, e.g.llama-3.1-8b-instant(Groq) orhf:google/flan-t5-small(HF prefix).HF_API_TOKEN: Hugging Face inference token (if using HF).GITHUB_TOKEN(optional): GitHub personal access token to increase API rate limits and access private repos you own.RENDER_JS: set to1to enable Playwright rendering for pages that require JS.
Important notes about rate limits:
- Provider rate limits (tokens-per-minute / requests) can cause
structure_evaluationorevaluate_readmeto return an error object in the JSON. If you seerate_limit_exceeded, either reduce parallelism, add retries/backoff, or upgrade your provider plan. - GitHub unauthenticated requests are also rate-limited; set
GITHUB_TOKENto increase limits.
- Quick one-repo structure test (debug):
$env:PYTHONPATH='.'
$env:EVAL_API_URL='https://api.groq.com/openai/v1/chat/completions'
$env:EVAL_API_KEY='your_key_here'
python .\scripts\debug_structure_eval.py Ananya-R2004 E-Gram-Panchayat- Run a gallery-first scan (Devfolio/Devpost style) and write clean JSON results:
$env:PYTHONPATH='.'
$env:RENDER_JS='1' # enable Playwright rendering
$env:EVAL_API_URL='https://api.groq.com/openai/v1/chat/completions'
python .\scripts\run_eval_from_gallery.py "https://bruteforce.devfolio.co/overview" --mode full --max 50 --out results\my_run.json- Test README-only evaluation for a single repo:
$env:PYTHONPATH='.'
$env:EVAL_API_KEY='your_key_here'
python .\scripts\run_one_eval.py- Enrich saved results with commit-based metadata (human summary, last commit dates):
python .\scripts\enrich_results.py --input results\my_run.jsonEach run writes a JSON array where each element is an object with fields similar to:
repo: canonical GitHub URL (https://github.com/{owner}/{repo})raw_repo_url: original link discovered on the project pagereadme_length: character count of the fetched README (0 if not found)commits: object withexists,commit_count,pre_cutoff_commits(list),repo_created_at,errorevaluation: LLM JSON evaluating the README (seeapp/llm.evaluate_readmeschema)structure_evaluation: LLM JSON evaluating folder structure (seeapp/llm.evaluate_structure) or anerrorobject if provider failedrepo_tree_meta: { count: number_of_paths, error: optional }source_project: the project page URL where the repo was discovered
Example entry (abridged):
{
"repo": "https://github.com/example/repo",
"raw_repo_url": "https://github.com/example/repo/tree/main",
"readme_length": 1200,
"commits": { ... },
"evaluation": { ... },
"structure_evaluation": { ... },
"repo_tree_meta": { "count": 120 }
}-
Rate limits from the eval provider (Groq/OpenAI/HF):
- Symptoms:
structure_evaluationcontains anerrorobject withcode: "rate_limit_exceeded"and a message indicating TPM limits. - Fixes: throttle LLM calls, add retries/exponential backoff, reduce prompt size (fewer paths), or upgrade the provider plan.
- Symptoms:
-
GitHub 404s or missing READMEs:
- Ensure
canonicalize_github_urlnormalized URLs are used (the orchestrator already canonicalizes links). - Set
GITHUB_TOKENto improve rate limits and access private repos you own.
- Ensure
-
Playwright hangs/timeouts:
- Set
RENDER_JS=1only when needed. Increase timeouts in the renderer if pages are slow.
- Set
app/renderer.py— Playwright helpers used to render pages and extract anchors.app/scraper.py— gallery-first discovery and heuristics.app/github.py— README fetch, commit listing, repo tree fetch, canonicalization utilities.app/llm.py— evaluate_readme, evaluate_structure, extract_repos_from_text. Supports Groq (OpenAI-compatible endpoint) and Hugging Face inference paths.scripts/run_eval_from_gallery.py— CLI orchestrator that discovers projects and evaluates them. Use--outto save a clean JSON file.
- Add robust retry/backoff and rate-limit-aware throttling for LLM calls (already implemented in later branches).
- Add CSV or summary export for quick human review.
- Add unit tests for
canonicalize_github_urland the GitHub helpers. - Add a lightweight local heuristic fallback for structure scoring to reduce LLM calls.