sr-pipeline

Safe, hybrid AI pipeline for systematic reviews and meta-analyses. Automation with human-in-the-loop.

Doing a systematic review means spending days on deduplication and title/abstract screening before you even touch the science. This toolkit automates those steps using Claude while keeping you in control of the decisions that matter.

Built from real SR/MA work in pediatric surgery. Tested on 500+ records across multiple projects. This is not "end-to-end automation." This is acceleration with human-in-the-loop, explicitly designed to prevent AI hallucinations from corrupting your data extraction and meta-analysis.

What it does

Database exports (PubMed / Scopus / Embase)
        │
        ▼
  1. merge_csvs()          → combine multiple exports into one DataFrame
  2. deduplicate()         → DOI-exact + title-fuzzy match (SequenceMatcher ≥ 0.90)
  3. screen_records()      → batch LLM screening against your PICO criteria
  4. generate_prisma()     → PRISMA 2020-compliant flow report
        │
        ▼
   artifacts/
     screening_results.csv    → all records with decision / confidence / reason
     prisma_report.md         → flow numbers ready to paste into your paper
     dedup.csv                → post-deduplication records

The screener uses Claude Haiku by default (fast, cheap — ~$0.03 per 100 records). Uncertain records can be re-run with Sonnet for a second opinion.

Quickstart

pip install sr-pipeline
export ANTHROPIC_API_KEY=sk-ant-...

from srma.screening import run_pipeline

results = run_pipeline(
    project_dir = "./my_review",    # must contain raw/ folder with exported CSVs
    inclusion = [
        "Original clinical study (RCT, cohort, or case series)",
        "Pediatric patients aged 0–18 years",
        "Diagnosis of anorectal malformation confirmed",
        "Reports at least one functional outcome",
    ],
    exclusion = [
        "Animal or in vitro studies",
        "Case reports (n < 5)",
        "Review articles, editorials, or conference abstracts",
        "Non-English publications",
    ],
)

# results = {"included": 42, "excluded": 187, "uncertain": 8, "output_dir": "..."}

Or via CLI:

srma --project-dir ./my_review \
     --inclusion inclusion_criteria.txt \
     --exclusion exclusion_criteria.txt

Project layout

my_review/
  raw/
    pubmed_export.csv        ← PubMed CSV export
    scopus_export.csv        ← Scopus CSV export
    embase_zotero.csv        ← Embase via Zotero CSV
  artifacts/                 ← auto-created by sr-pipeline
    merged.csv
    dedup.csv
    screening_results.csv
    prisma_report.md

Export format: Zotero CSV export is recommended (works for PubMed, Scopus, Embase). Direct PubMed CSV also works.

Handling uncertain records

The screener marks records as uncertain when the abstract is too short to judge or the criteria fit is ambiguous. Review these manually or retry with a stronger model:

# Retry uncertain records with Sonnet
run_pipeline(
    project_dir     = "./my_review",
    inclusion       = INCLUSION_CRITERIA,
    exclusion       = EXCLUSION_CRITERIA,
    model           = "extraction",   # → Claude Sonnet
    retry_uncertain = True,
)

API reference

`deduplicate(df, title_threshold=0.90)`

Remove duplicates from a DataFrame of citations.

from srma.screening import deduplicate
clean_df, n_before, n_after = deduplicate(df)

Parameter	Default	Description
`df`	—	DataFrame with `Title` and `DOI` columns
`title_threshold`	`0.90`	Fuzzy match threshold for title deduplication

Returns (cleaned_df, n_before, n_after).

`screen_records(df, inclusion, exclusion, model="screening")`

Screen a DataFrame against eligibility criteria via LLM.

from srma.screening import screen_records
df = screen_records(df, inclusion=["..."], exclusion=["..."])
# df now has: decision, confidence, reason columns

Decision values: "include" | "exclude" | "uncertain"

`generate_prisma_report(project_name, n_raw, n_after_dedup, df)`

Generate a PRISMA 2020 flow report string.

from srma.screening import generate_prisma_report
report, n_inc, n_exc, n_unc = generate_prisma_report("MY_PROJECT", 500, 420, df)

`normalize_doi(doi)` / `normalize_title(title)`

Text normalization helpers used internally — useful for custom deduplication logic.

from srma.utils import normalize_doi, normalize_title
normalize_doi("https://doi.org/10.1234/abc")  # → "10.1234/abc"
normalize_title("Effect of Surgery: A Review")  # → "effect of surgery a review"

Model selection

Role key	Default model	Best for
`"screening"`	Claude Haiku	High-volume title/abstract screening
`"extraction"`	Claude Sonnet	Data extraction, uncertain records
`"drafting"`	Claude Sonnet	Results section drafting
`"polishing"`	Claude Sonnet	Manuscript polish

Override: screen_records(df, ..., model="extraction")

Running tests

git clone https://github.com/tuyentran-md/sr-pipeline
cd sr-pipeline
pip install -e ".[dev]"
pytest

68 tests, no API calls required. Tests use mocked LLM responses.

Roadmap

Full-text PDF highlighting (AI locates data, human extracts) (srma.extraction)
R analysis script generator (srma.r_analysis)
Reference verification via CrossRef API (srma.references)
PROSPERO protocol outline generator (srma.outline)
Network meta-analysis support (srma.nma)

Background

This repo grew out of a real systematic review on outcomes after anorectal malformation repair (E1_ARM project). The deduplication and screening logic has been validated against manual screening on ~500 records. Our core belief: AI should map and screen, but humans must extract and interpret.

Read the full methodology on how to use AI for systematic reviews safely: How to Use AI for Systematic Reviews Without Compromising Rigor

License

MIT — see LICENSE.

Built by Tuyen Tran — pediatric surgeon & clinical researcher.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
examples		examples
srma		srma
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sr-pipeline

What it does

Quickstart

Project layout

Handling uncertain records

API reference

`deduplicate(df, title_threshold=0.90)`

`screen_records(df, inclusion, exclusion, model="screening")`

`generate_prisma_report(project_name, n_raw, n_after_dedup, df)`

`normalize_doi(doi)` / `normalize_title(title)`

Model selection

Running tests

Roadmap

Background

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sr-pipeline

What it does

Quickstart

Project layout

Handling uncertain records

API reference

deduplicate(df, title_threshold=0.90)

screen_records(df, inclusion, exclusion, model="screening")

generate_prisma_report(project_name, n_raw, n_after_dedup, df)

normalize_doi(doi) / normalize_title(title)

Model selection

Running tests

Roadmap

Background

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`deduplicate(df, title_threshold=0.90)`

`screen_records(df, inclusion, exclusion, model="screening")`

`generate_prisma_report(project_name, n_raw, n_after_dedup, df)`

`normalize_doi(doi)` / `normalize_title(title)`

Packages