This package supports two transparent workflows:
- Reproduce all main and appendix exhibits from assembled data (no API calls).
- Run Stage 1-4 extraction on new papers using your own API key.
No keys are stored in this package. Set credentials only through environment variables.
configs/: reproducible configuration files.analysis_data/: downloadable analysis-ready datasets and table outputs.prompts/: Stage 1/2 prompt text used in retrieval.schemas/: JSON response schemas for structured model outputs.scripts/: public runners (full reproduction, extraction pipeline, safety audit).demo_input/: place user paper text/PDF inputs here.demo_output/: created at runtime.
Use this when you want all manuscript figures/tables without rerunning extraction.
- Python 3.10+
- R 4.3+ (only needed for R-based helper scripts)
- The assembled project data in
int_data/ - Citation file in
int_data/with columnspaper_idandcoalesced_cites(for citation-based exhibits)
pip install -r requirements_full.txtFrom the repro_kit/ root:
python scripts/run_full_reproduction.pyOptional flags:
--skip-validationskips Brodeur/Plausibly validation rebuild.--allow-missingskips missing helper scripts and continues.--run-auditruns the public-package safety audit at the end.--rscript-exe "<path-to-Rscript>"sets a custom Rscript binary.
scripts/run_full_reproduction.py executes helper scripts that are present in scripts/ and reports any missing modules explicitly.
Expected helper script names for assembled-data builds:
build_method_figures.Rbuild_edge_overlap_figures.Rbuild_core_figures.pybuild_publication_predictor_figures.pyvalidate_brodeur.R(unless--skip-validation)validate_exogenous_benchmark.py(unless--skip-validation)build_validation_tables.R(unless--skip-validation)
If you only want analysis datasets (without running extraction), use:
analysis_data/tables/for ready-to-use output tables.analysis_data/core/graph_aggregated/for split aggregated graph shards.analysis_data/core/graph_runs/for nine run-level graph files.analysis_data/core/paper_level/for paper-level analysis inputs.analysis_data/core/benchmarks/for benchmark datasets.
All packaged files are below 25 MiB.
Quick start:
python scripts/materialize_analysis_data.py --root .If you only want rebuilt graph parquet files (without full materialization), run:
python scripts/join_graph_data_parts.py --analysis-data-dir analysis_data --output-dir int_dataThis rebuilds:
int_data/claim_graph_all_nine_iter_union_aggregated_meta.parquetfrom split aggregated shards.int_data/claim_graph_runs_all.parquetfrom the nine run-level files.
This stages analysis_data into expected runtime locations (int_data/, results/tables/, benchmark paths), so all build scripts can run directly.
To generate checksums and sizes:
python scripts/build_analysis_data_manifest.pyUse this when you want to run the retrieval pipeline on your own paper(s).
pip install -r requirements.txtOption A (TXT already available):
- Place one or more
.txtfiles indemo_input/.
Option B (PDF input):
python scripts/convert_pdf_to_text.py \
--input demo_input \
--output demo_input \
--parser auto \
--first-pages 30- Edit
configs/config.yaml. - For a fresh config template, copy from
configs/config.example.yaml.
PowerShell:
$env:OPENAI_API_KEY="YOUR_KEY_HERE"Bash:
export OPENAI_API_KEY="YOUR_KEY_HERE"Dry run (build JSONL requests only):
python scripts/run_extraction_pipeline.py \
--config configs/config.yamlLive API execution:
python scripts/run_extraction_pipeline.py \
--config configs/config.yaml \
--execute \
--with-snippet \
--with-stage3Outputs are written to demo_output/ (or your configured output path), including:
stage1_outputs.jsonlstage2_edges_raw.jsonlstage2_snippet_edges_raw.jsonl(if enabled)stage3_edges_with_jel.jsonl(if enabled)stage4_edge_overlap_counts.csvstage4_edges_eo_ge*.csvstage4_paper_level_summary.csv
Run this before release:
python scripts/audit_public_package.pyIt fails if it detects:
- hardcoded API key literals,
- explicit key assignments in files,
- internal/private-note markers,
- legacy internal version labels in public-facing content.
- This package is designed so users can skip extraction and reproduce exhibits directly from assembled data.
- API usage costs and latency depend on model choice, token usage, and iteration counts in
config.yaml. - Keep credentials out of all tracked files. Use environment variables only.