mega-scraping script to retrieve 600k+ teachers and ranking them based on hacker compatibility. goes from a school's website -> all teachers.
to get started:
git clone <this repo> schoolyank
cd schoolyank
bun install
python3 -m pip install -r requirements.txt
browser-use install
cp .env.example .env
bun index.ts --preflight
bun index.tsfirst run launches an interactive setup wizard: you paste an openrouter key, pick a model, optionally let it spin up an exa account in a disposable browser, then it drops the keys into .env and kicks off the scrape.
what it does:
- checks for apptegy/finalsite shortcuts first (hasura + thrillshare) before spending tokens on the agent
- spins up the local browser-use runner with a structured prompt to crawl staff directories, tabs, filters, pdf links, whatever it takes
- normalizes the agent output (name parsing, email cleanup, dedupe, role scrub, email pattern inference) without dropping non-stem teachers anymore
- resolves the school in the urban institute nces api for verified addresses + phone numbers and logs why if the match fails
- pings dns mx + smtp rcpt to for every email; nukes addresses the server rejects, keeps inconclusive ones untouched
- writes clean csvs to
output/and, in batch mode, a mergedoutput/all.csv - pushes rows into postgres when
DATABASE_URLis set so downstream jobs can keep going even if the cli dies - streams the play-by-play into slack if
SLACK_BOT_TOKEN+SLACK_CHANNEL_IDshow up so you can watch the chaos from your phone
interactive mode (bun index.ts) is the guided flow: asks for one url and prints output.
--url https://example.comscrape one site with no prompts (repeatable)- bare positional urls work too:
bun index.ts https://a.edu https://b.org --urls-file urls.txtload one url per line (lines starting with#ignored)- shorthand
@schools_with_staff_urls.csvis the same as--schools-csv --output path.csvchoose the filename when you pass exactly one url--merged-output batch.csvoverride the merged csv path (defaultoutput/all.csv)--concurrency N/-j Nadjust parallel browsers (default 6);--maxdoubles cores up toSCHOOLYANK_MAX_CONCURRENCY--forcere-scrape even if the csv already exists or another worker claimed it--interactiveforce the cute cli even when you provided urls--debugfirehose every agent/nces/email log line to stderr--preflightrun environment + dependency checks and exit (same asbun index.ts --preflight)- every batch run writes progress to
output/status.csv(setSTATUS_CSV_PATHto move it)
minimal requirement: an OpenRouter key.
OPENROUTER_API_KEY(required) + optionallyAI_MODEL,AI_BASE_URLEXA_API_KEY(optional) powers the ranking/enrichment scripts; the main scrape currently runs without it and falls back to DDG when those scripts need web searchSLACK_BOT_TOKEN,SLACK_CHANNEL_ID,SLACK_ALERT_USER(optional) enable slack threading + error pingsDATABASE_URL(optional) turns on the postgres sink; setREQUIRE_DATABASE=trueif csv-only mode should hard fail- concurrency and startup knobs live behind envs:
SCHOOLYANK_MAX_CONCURRENCY,SCHOOLYANK_MAX_SCRAPE_ATTEMPTS,SCHOOLYANK_MAX_RATE_LIMIT_ATTEMPTS,SCHOOLYANK_BROWSER_START_* - logging: set
LOG_FILEto capture a rolling log,STATUS_CSV_PATHto relocate the status tracker - docker compose uses the same envs;
docker-compose upbrings up postgres + the scraper worker if you like running this in a box
- default per-school csv:
output/<slugified-domain>.csv - schools csv mode writes to
schools/<STATE>/<city>/<Hs ID>.csvso nothing collides - merged batch csv lands at
output/all.csvunless you override it - status tracker lives at
output/status.csv(per run) - when postgres is configured the upsert target is
public.extracted_teachers - log files go wherever
LOG_FILEsays; otherwise stdout/stderr is all you get
- no staff directory → no teachers; portals that hide everything behind a parent login still beat the agent
- smtp validation needs port 25 — most home ISPs block it, so expect “inconclusive” spam in that case
- captchas, sso walls, and “enter your student id” forms still end the ride
- directories that obfuscate emails with weirdo javascript (other than finalsite) require manual cleanup
- output counts can wobble by a few teachers run-to-run because the agent explores in a slightly different order every time
- long-running batches should stay under whatever your computer can cool;
--maxwill happily light all cores
bun index.ts --helpprints the full flag bible with prettier formattingbun run preflightstandalone env check (same as the--preflightflag)bun run statsquick aggregate counts over everything inschools/bun run clean:teachersre-validatesdist/teachers.csvin chunks with OpenAI web search (setOPENAI_API_KEYfirst)bun run rankscores teachers out of 100, pulls web presence (Exa/DDG), and emits ranked csvs per schoolbun run typecheckkeeps the bun + typescript types honest- python helpers live in
scripts/for bulk csv surgery if you enjoy command-line archaeology
thats it. go spam some teachers!! (legally) and send them something nice.
this is a vastly improved version of https://github.com/Hex-4/schoolyank by @Hex-4 thank you for the direction!
