Causal Claims Reproduction Package

This package supports two transparent workflows:

Reproduce all main and appendix exhibits from assembled data (no API calls).
Run Stage 1-4 extraction on new papers using your own API key.

No keys are stored in this package. Set credentials only through environment variables.

Directory structure

configs/: reproducible configuration files.
analysis_data/: downloadable analysis-ready datasets and table outputs.
prompts/: Stage 1/2 prompt text used in retrieval.
schemas/: JSON response schemas for structured model outputs.
scripts/: public runners (full reproduction, extraction pipeline, safety audit).
demo_input/: place user paper text/PDF inputs here.
demo_output/: created at runtime.

Workflow A: Full paper reproduction from assembled data

Use this when you want all manuscript figures/tables without rerunning extraction.

Prerequisites

Python 3.10+
R 4.3+ (only needed for R-based helper scripts)
The assembled project data in int_data/
Citation file in int_data/ with columns paper_id and coalesced_cites (for citation-based exhibits)

Install Python dependencies

pip install -r requirements_full.txt

Run full reproduction

From the repro_kit/ root:

python scripts/run_full_reproduction.py

Optional flags:

--skip-validation skips Brodeur/Plausibly validation rebuild.
--allow-missing skips missing helper scripts and continues.
--run-audit runs the public-package safety audit at the end.
--rscript-exe "<path-to-Rscript>" sets a custom Rscript binary.

scripts/run_full_reproduction.py executes helper scripts that are present in scripts/ and reports any missing modules explicitly.

Expected helper script names for assembled-data builds:

build_method_figures.R
build_edge_overlap_figures.R
build_core_figures.py
build_publication_predictor_figures.py
validate_brodeur.R (unless --skip-validation)
validate_exogenous_benchmark.py (unless --skip-validation)
build_validation_tables.R (unless --skip-validation)

Data-only download

If you only want analysis datasets (without running extraction), use:

analysis_data/tables/ for ready-to-use output tables.
analysis_data/core/graph_aggregated/ for split aggregated graph shards.
analysis_data/core/graph_runs/ for nine run-level graph files.
analysis_data/core/paper_level/ for paper-level analysis inputs.
analysis_data/core/benchmarks/ for benchmark datasets.

All packaged files are below 25 MiB.

Quick start:

python scripts/materialize_analysis_data.py --root .

If you only want rebuilt graph parquet files (without full materialization), run:

python scripts/join_graph_data_parts.py --analysis-data-dir analysis_data --output-dir int_data

This rebuilds:

int_data/claim_graph_all_nine_iter_union_aggregated_meta.parquet from split aggregated shards.
int_data/claim_graph_runs_all.parquet from the nine run-level files.

This stages analysis_data into expected runtime locations (int_data/, results/tables/, benchmark paths), so all build scripts can run directly.

To generate checksums and sizes:

python scripts/build_analysis_data_manifest.py

Workflow B: Stage 1-4 extraction on new papers

Use this when you want to run the retrieval pipeline on your own paper(s).

Step 1: Install extraction dependencies

pip install -r requirements.txt

Step 2: Prepare input text

Option A (TXT already available):

Place one or more .txt files in demo_input/.

Option B (PDF input):

python scripts/convert_pdf_to_text.py \
  --input demo_input \
  --output demo_input \
  --parser auto \
  --first-pages 30

Step 3: Configure pipeline

Edit configs/config.yaml.
For a fresh config template, copy from configs/config.example.yaml.

Step 4: Set API key in environment

PowerShell:

$env:OPENAI_API_KEY="YOUR_KEY_HERE"

Bash:

export OPENAI_API_KEY="YOUR_KEY_HERE"

Step 5: Run Stage 1-4

Dry run (build JSONL requests only):

python scripts/run_extraction_pipeline.py \
  --config configs/config.yaml

Live API execution:

python scripts/run_extraction_pipeline.py \
  --config configs/config.yaml \
  --execute \
  --with-snippet \
  --with-stage3

Outputs are written to demo_output/ (or your configured output path), including:

stage1_outputs.jsonl
stage2_edges_raw.jsonl
stage2_snippet_edges_raw.jsonl (if enabled)
stage3_edges_with_jel.jsonl (if enabled)
stage4_edge_overlap_counts.csv
stage4_edges_eo_ge*.csv
stage4_paper_level_summary.csv

Transparency and safety checks

Run this before release:

python scripts/audit_public_package.py

It fails if it detects:

hardcoded API key literals,
explicit key assignments in files,
internal/private-note markers,
legacy internal version labels in public-facing content.

Notes

This package is designed so users can skip extraction and reproduce exhibits directly from assembled data.
API usage costs and latency depend on model choice, token usage, and iteration counts in config.yaml.
Keep credentials out of all tracked files. Use environment variables only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal Claims Reproduction Package

Directory structure

Workflow A: Full paper reproduction from assembled data

Prerequisites

Install Python dependencies

Run full reproduction

Data-only download

Workflow B: Stage 1-4 extraction on new papers

Step 1: Install extraction dependencies

Step 2: Prepare input text

Step 3: Configure pipeline

Step 4: Set API key in environment

Step 5: Run Stage 1-4

Transparency and safety checks

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
analysis_data		analysis_data
configs		configs
demo_input		demo_input
demo_output		demo_output
prompts		prompts
schemas		schemas
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
Paper.pdf		Paper.pdf
README.md		README.md
requirements.txt		requirements.txt
requirements_full.txt		requirements_full.txt

Folders and files

Latest commit

History

Repository files navigation

Causal Claims Reproduction Package

Directory structure

Workflow A: Full paper reproduction from assembled data

Prerequisites

Install Python dependencies

Run full reproduction

Data-only download

Workflow B: Stage 1-4 extraction on new papers

Step 1: Install extraction dependencies

Step 2: Prepare input text

Step 3: Configure pipeline

Step 4: Set API key in environment

Step 5: Run Stage 1-4

Transparency and safety checks

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages