Skip to content

Refactor scenarios module for better separation of concerns #2417

@johnjosephhorton

Description

@johnjosephhorton

Problem

The edsl/scenarios/ module has grown organically and several files have become god objects with mixed responsibilities. Key issues:

Giant files with too many concerns

  • scenario.py (1519 lines) — core dict class mixed with factory methods (from_pdf, from_html, from_docx, from_image, from_url), chunking, QR codes, display logic, serialization
  • scenario_list_transformer.py (1578 lines) — all list transform logic in a single file
  • file_store.py (1493 lines) — includes ~150 lines of commented-out subclass code

Naming violations

  • DocxScenario.py and PdfExtractor.py use PascalCase filenames, breaking PEP 8 convention used everywhere else

Dead code

  • file_store.py contains ~150 lines of commented-out CSVFileStore, PDFFileStore, etc. subclasses
  • Planning docs (scenario_list_remove.md, scenario_list_source_refactor.md) checked into the package directory

Design note: ScenarioList delegation pattern is intentional

scenario_list.py (1962 lines) delegates most methods as one-line wrappers to ScenarioListTransformer. This is by designscenario_list.py serves as a table of contents that lets LLMs (and humans) quickly see every available method in one place, while keeping the implementation details in a separate file. This pattern should be preserved. The transformer implementation file can still be broken up into smaller files.

Design note: Keep from_* classmethods, offload implementations to factory

The from_csv, from_pdf, from_url, etc. classmethods on Scenario and ScenarioList are idiomatic Python (consistent with datetime.fromtimestamp, dict.fromkeys, DataFrame.from_records). The public API should not change — no namespace objects like ScenarioList.build_from.csv(). Instead, the from_* classmethods stay as thin TOC entries that delegate to ScenarioFactory / the sources/ subpackage for their implementations.

# scenario_list.py — public API unchanged
@classmethod
def from_csv(cls, source, **kwargs):
    """Create a ScenarioList from a CSV file or URL."""
    return ScenarioFactory.from_csv(source, **kwargs)

Proposed reorganization

scenarios/
├── __init__.py              # public API (Scenario, ScenarioList, FileStore) — stays stable
├── scenario.py              # core dict-like Scenario only (~300 lines)
├── scenario_list.py         # ScenarioList TOC — thin delegation wrappers (keep as-is)
├── file_store.py            # FileStore base (slim, no commented-out code)
├── exceptions.py
│
├── handlers/                # ✓ already good — per-format FileStore handlers
├── sources/                 # ✓ already good — per-format ScenarioList sources
│
├── factory.py               # from_* implementations consolidated here, called by TOC methods
├── serialization/           # group serializer files
│   ├── scenario.py
│   └── scenario_list.py
│
├── transforms/              # break up the 1578-line transformer into focused files
│   ├── filter.py
│   ├── mutate.py
│   ├── reshape.py           # pivot, unpivot, group_by, expand
│   ├── combine.py           # concatenate, zip, string_cat
│   └── select.py            # select, drop, keep, rename
│
└── contrib/                 # peripheral features that aren't core
    ├── agent_blueprint.py
    ├── conjoint.py
    ├── gcs.py
    ├── qr_code.py
    └── ranking.py

Key principles

  1. Single responsibility: Scenario should be a dict with serialization. Move from_pdf, from_html, from_docx, from_image, chunk, qr_codes, etc. implementations out.
  2. Preserve the ScenarioList TOC pattern: Keep scenario_list.py as a thin delegation layer. Break up scenario_list_transformer.py into smaller focused files under transforms/, but the delegation wrappers in scenario_list.py stay.
  3. Keep from_* classmethods as public API: No breaking changes. The from_* methods remain on the classes as thin wrappers that delegate to ScenarioFactory / sources/ for implementation.
  4. Delete dead code: Remove commented-out classes in file_store.py and .md planning docs from the package directory.
  5. Rename PascalCase files: DocxScenario.pydocx_scenario.py, PdfExtractor.pypdf_extractor.py.
  6. Keep __init__.py stable: All public imports stay the same — purely internal reorganization. Users still do from edsl.scenarios import Scenario, ScenarioList.

Suggested order of operations (lowest risk first)

  1. Delete dead/commented-out code and planning docs
  2. Rename PascalCase files
  3. Extract Scenario.from_* factory method implementations into factory.py / sources/ (keep thin classmethods on Scenario and ScenarioList for backward compat)
  4. Break up scenario_list_transformer.py into transforms/ subpackage
  5. Move peripheral features into contrib/

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions