-
Notifications
You must be signed in to change notification settings - Fork 67
Refactor scenarios module for better separation of concerns #2417
Description
Problem
The edsl/scenarios/ module has grown organically and several files have become god objects with mixed responsibilities. Key issues:
Giant files with too many concerns
scenario.py(1519 lines) — core dict class mixed with factory methods (from_pdf,from_html,from_docx,from_image,from_url), chunking, QR codes, display logic, serializationscenario_list_transformer.py(1578 lines) — all list transform logic in a single filefile_store.py(1493 lines) — includes ~150 lines of commented-out subclass code
Naming violations
DocxScenario.pyandPdfExtractor.pyuse PascalCase filenames, breaking PEP 8 convention used everywhere else
Dead code
file_store.pycontains ~150 lines of commented-outCSVFileStore,PDFFileStore, etc. subclasses- Planning docs (
scenario_list_remove.md,scenario_list_source_refactor.md) checked into the package directory
Design note: ScenarioList delegation pattern is intentional
scenario_list.py (1962 lines) delegates most methods as one-line wrappers to ScenarioListTransformer. This is by design — scenario_list.py serves as a table of contents that lets LLMs (and humans) quickly see every available method in one place, while keeping the implementation details in a separate file. This pattern should be preserved. The transformer implementation file can still be broken up into smaller files.
Design note: Keep from_* classmethods, offload implementations to factory
The from_csv, from_pdf, from_url, etc. classmethods on Scenario and ScenarioList are idiomatic Python (consistent with datetime.fromtimestamp, dict.fromkeys, DataFrame.from_records). The public API should not change — no namespace objects like ScenarioList.build_from.csv(). Instead, the from_* classmethods stay as thin TOC entries that delegate to ScenarioFactory / the sources/ subpackage for their implementations.
# scenario_list.py — public API unchanged
@classmethod
def from_csv(cls, source, **kwargs):
"""Create a ScenarioList from a CSV file or URL."""
return ScenarioFactory.from_csv(source, **kwargs)Proposed reorganization
scenarios/
├── __init__.py # public API (Scenario, ScenarioList, FileStore) — stays stable
├── scenario.py # core dict-like Scenario only (~300 lines)
├── scenario_list.py # ScenarioList TOC — thin delegation wrappers (keep as-is)
├── file_store.py # FileStore base (slim, no commented-out code)
├── exceptions.py
│
├── handlers/ # ✓ already good — per-format FileStore handlers
├── sources/ # ✓ already good — per-format ScenarioList sources
│
├── factory.py # from_* implementations consolidated here, called by TOC methods
├── serialization/ # group serializer files
│ ├── scenario.py
│ └── scenario_list.py
│
├── transforms/ # break up the 1578-line transformer into focused files
│ ├── filter.py
│ ├── mutate.py
│ ├── reshape.py # pivot, unpivot, group_by, expand
│ ├── combine.py # concatenate, zip, string_cat
│ └── select.py # select, drop, keep, rename
│
└── contrib/ # peripheral features that aren't core
├── agent_blueprint.py
├── conjoint.py
├── gcs.py
├── qr_code.py
└── ranking.py
Key principles
- Single responsibility:
Scenarioshould be a dict with serialization. Movefrom_pdf,from_html,from_docx,from_image,chunk,qr_codes, etc. implementations out. - Preserve the ScenarioList TOC pattern: Keep
scenario_list.pyas a thin delegation layer. Break upscenario_list_transformer.pyinto smaller focused files undertransforms/, but the delegation wrappers inscenario_list.pystay. - Keep
from_*classmethods as public API: No breaking changes. Thefrom_*methods remain on the classes as thin wrappers that delegate toScenarioFactory/sources/for implementation. - Delete dead code: Remove commented-out classes in
file_store.pyand.mdplanning docs from the package directory. - Rename PascalCase files:
DocxScenario.py→docx_scenario.py,PdfExtractor.py→pdf_extractor.py. - Keep
__init__.pystable: All public imports stay the same — purely internal reorganization. Users still dofrom edsl.scenarios import Scenario, ScenarioList.
Suggested order of operations (lowest risk first)
- Delete dead/commented-out code and planning docs
- Rename PascalCase files
- Extract
Scenario.from_*factory method implementations intofactory.py/sources/(keep thin classmethods onScenarioandScenarioListfor backward compat) - Break up
scenario_list_transformer.pyintotransforms/subpackage - Move peripheral features into
contrib/
🤖 Generated with Claude Code