fix(ci): harden weekly etl integrity and remove nltk#17
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the weekly ETL pipeline by standardizing artifact restoration, adding a bridge integrity gate before publish, and simplifying Reddit sentiment analysis by removing NLTK in favor of vaderSentiment.
Changes:
- Introduces
scripts/materialize_etl_artifacts.pyto restore ETL artifacts (legacy/latest/history/metadata + frontend assets) in a consistent way and wires it into the weekly workflow. - Adds
scripts/check_bridge_integrity.pyplus a new test suite, and enforces it as a publish gate in the weekly workflow. - Replaces NLTK sentiment usage/download steps with
vaderSentiment, updating requirements, CI, and tests accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/etl_semanal.yml |
Replaces ad-hoc shell restore logic with artifact materialization script; adds bridge integrity gate; removes NLTK download steps. |
.github/workflows/ci.yml |
Removes NLTK data download step now that NLTK is no longer used. |
.github/workflows/dependency_security.yml |
Removes the temporary NLTK CVE ignore and runs pip-audit normally. |
scripts/materialize_etl_artifacts.py |
New utility to copy/overlay artifacts into datos/ + frontend/assets/data across supported artifact layouts. |
scripts/check_bridge_integrity.py |
New integrity validator for bridge JSON outputs used as a workflow gate before publish. |
backend/reddit_etl.py |
Switches sentiment analyzer import from NLTK to vaderSentiment and removes runtime NLTK resource download step. |
backend/requirements.txt |
Removes nltk dependency and adds vaderSentiment. |
tests/test_workflow_etl_contract.py |
Updates workflow contract assertions to reflect new script-based workflow steps; adds regression for removed NLTK download step. |
tests/test_reddit_etl.py |
Updates step-count/name assertions after removing the NLTK preparation step. |
tests/test_materialize_etl_artifacts.py |
New tests covering both nested datos/ and flattened artifact layouts, plus overlay order semantics. |
tests/test_check_bridge_integrity.py |
New tests for bridge integrity checks across healthy, collapsed-history, and bootstrap scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| home_highlights = _load_json(assets_root / "home_highlights.json") | ||
| highlights = home_highlights.get("highlights", []) | ||
| minimum_highlights = 3 if expect_previous_history else 2 | ||
| if len(highlights) < minimum_highlights: | ||
| errors.append( | ||
| f"home_highlights highlights={len(highlights)} < {minimum_highlights}" | ||
| ) |
There was a problem hiding this comment.
check_bridge_integrity hard-fails when home_highlights.highlights has fewer than 3 entries whenever expect_previous_history=True. However export_history_json.build_home_highlights_payload() can legitimately emit <3 highlights when some upstream bridge summaries are null/empty (it only adds candidates when a specific payload is present). This makes the integrity gate brittle and can block publishes even when the bridge files are structurally valid and the UI can run in degraded mode. Consider validating the JSON shape (e.g., highlights is a list, candidate_count >= len(highlights)) and using a lower/conditional minimum (or tie the minimum to candidate_count) instead of a fixed 3.
This pull request introduces several improvements to the ETL workflow, artifact handling, and data integrity validation, while also simplifying the sentiment analysis dependency in the backend. The most significant changes are the introduction of two new scripts for artifact materialization and bridge integrity checks, updates to the workflow YAMLs to use these scripts, and the replacement of NLTK with vaderSentiment for sentiment analysis. Additionally, a new test suite ensures the correctness of the bridge integrity checks.
Workflow and Artifact Handling Improvements:
scripts/materialize_etl_artifacts.py, to consistently restore and organize ETL artifacts and frontend assets in the project workspace, replacing scattered shell logic in workflows. (.github/workflows/etl_semanal.yml,scripts/materialize_etl_artifacts.py) [1] [2] [3].github/workflows/etl_semanal.yml) [1] [2]datos/anddatos/latest/, increasing robustness. (.github/workflows/etl_semanal.yml) [1] [2]Data Integrity and Validation Enhancements:
scripts/check_bridge_integrity.py, a script that validates the integrity and continuity of frontend bridge datasets before publishing. This is now enforced as a workflow step, with logic to require previous history when appropriate. (.github/workflows/etl_semanal.yml,scripts/check_bridge_integrity.py) [1] [2]tests/test_check_bridge_integrity.py)Backend and Dependency Simplification:
vaderSentimentfor sentiment analysis in the Reddit ETL, removing NLTK resource downloads and simplifying requirements. (backend/reddit_etl.py,backend/requirements.txt,.github/workflows/ci.yml,.github/workflows/etl_semanal.yml) [1] [2] [3] [4] [5] [6] [7].github/workflows/dependency_security.yml)