Skip to content

fix(ci): harden weekly etl integrity and remove nltk#17

Merged
Sam-24-dev merged 1 commit into
mainfrom
codex/fix-weekly-etl-integrity
Mar 24, 2026
Merged

fix(ci): harden weekly etl integrity and remove nltk#17
Sam-24-dev merged 1 commit into
mainfrom
codex/fix-weekly-etl-integrity

Conversation

@Sam-24-dev

Copy link
Copy Markdown
Owner

This pull request introduces several improvements to the ETL workflow, artifact handling, and data integrity validation, while also simplifying the sentiment analysis dependency in the backend. The most significant changes are the introduction of two new scripts for artifact materialization and bridge integrity checks, updates to the workflow YAMLs to use these scripts, and the replacement of NLTK with vaderSentiment for sentiment analysis. Additionally, a new test suite ensures the correctness of the bridge integrity checks.

Workflow and Artifact Handling Improvements:

  • Added a new script, scripts/materialize_etl_artifacts.py, to consistently restore and organize ETL artifacts and frontend assets in the project workspace, replacing scattered shell logic in workflows. (.github/workflows/etl_semanal.yml, scripts/materialize_etl_artifacts.py) [1] [2] [3]
  • Updated the ETL workflow to use the new artifact materialization script for both previous and current aggregate outputs, ensuring reliable artifact restoration. (.github/workflows/etl_semanal.yml) [1] [2]
  • Improved artifact verification by checking for required files in both datos/ and datos/latest/, increasing robustness. (.github/workflows/etl_semanal.yml) [1] [2]

Data Integrity and Validation Enhancements:

  • Introduced scripts/check_bridge_integrity.py, a script that validates the integrity and continuity of frontend bridge datasets before publishing. This is now enforced as a workflow step, with logic to require previous history when appropriate. (.github/workflows/etl_semanal.yml, scripts/check_bridge_integrity.py) [1] [2]
  • Added a comprehensive test suite for bridge integrity validation, covering both healthy and degraded scenarios. (tests/test_check_bridge_integrity.py)

Backend and Dependency Simplification:

  • Replaced NLTK with vaderSentiment for sentiment analysis in the Reddit ETL, removing NLTK resource downloads and simplifying requirements. (backend/reddit_etl.py, backend/requirements.txt, .github/workflows/ci.yml, .github/workflows/etl_semanal.yml) [1] [2] [3] [4] [5] [6] [7]
  • Removed the temporary CVE ignore for NLTK in the dependency security workflow, as NLTK is no longer used. (.github/workflows/dependency_security.yml)

Copilot AI review requested due to automatic review settings March 23, 2026 23:47

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the weekly ETL pipeline by standardizing artifact restoration, adding a bridge integrity gate before publish, and simplifying Reddit sentiment analysis by removing NLTK in favor of vaderSentiment.

Changes:

  • Introduces scripts/materialize_etl_artifacts.py to restore ETL artifacts (legacy/latest/history/metadata + frontend assets) in a consistent way and wires it into the weekly workflow.
  • Adds scripts/check_bridge_integrity.py plus a new test suite, and enforces it as a publish gate in the weekly workflow.
  • Replaces NLTK sentiment usage/download steps with vaderSentiment, updating requirements, CI, and tests accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/etl_semanal.yml Replaces ad-hoc shell restore logic with artifact materialization script; adds bridge integrity gate; removes NLTK download steps.
.github/workflows/ci.yml Removes NLTK data download step now that NLTK is no longer used.
.github/workflows/dependency_security.yml Removes the temporary NLTK CVE ignore and runs pip-audit normally.
scripts/materialize_etl_artifacts.py New utility to copy/overlay artifacts into datos/ + frontend/assets/data across supported artifact layouts.
scripts/check_bridge_integrity.py New integrity validator for bridge JSON outputs used as a workflow gate before publish.
backend/reddit_etl.py Switches sentiment analyzer import from NLTK to vaderSentiment and removes runtime NLTK resource download step.
backend/requirements.txt Removes nltk dependency and adds vaderSentiment.
tests/test_workflow_etl_contract.py Updates workflow contract assertions to reflect new script-based workflow steps; adds regression for removed NLTK download step.
tests/test_reddit_etl.py Updates step-count/name assertions after removing the NLTK preparation step.
tests/test_materialize_etl_artifacts.py New tests covering both nested datos/ and flattened artifact layouts, plus overlay order semantics.
tests/test_check_bridge_integrity.py New tests for bridge integrity checks across healthy, collapsed-history, and bootstrap scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +84
home_highlights = _load_json(assets_root / "home_highlights.json")
highlights = home_highlights.get("highlights", [])
minimum_highlights = 3 if expect_previous_history else 2
if len(highlights) < minimum_highlights:
errors.append(
f"home_highlights highlights={len(highlights)} < {minimum_highlights}"
)

Copilot AI Mar 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_bridge_integrity hard-fails when home_highlights.highlights has fewer than 3 entries whenever expect_previous_history=True. However export_history_json.build_home_highlights_payload() can legitimately emit <3 highlights when some upstream bridge summaries are null/empty (it only adds candidates when a specific payload is present). This makes the integrity gate brittle and can block publishes even when the bridge files are structurally valid and the UI can run in degraded mode. Consider validating the JSON shape (e.g., highlights is a list, candidate_count >= len(highlights)) and using a lower/conditional minimum (or tie the minimum to candidate_count) instead of a fixed 3.

Copilot uses AI. Check for mistakes.
@Sam-24-dev Sam-24-dev merged commit 34a53c4 into main Mar 24, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants