fix(ci): harden weekly etl integrity and remove nltk by Sam-24-dev · Pull Request #17 · Sam-24-dev/Technology-trend-analysis-platform

Sam-24-dev · 2026-03-23T23:47:04Z

This pull request introduces several improvements to the ETL workflow, artifact handling, and data integrity validation, while also simplifying the sentiment analysis dependency in the backend. The most significant changes are the introduction of two new scripts for artifact materialization and bridge integrity checks, updates to the workflow YAMLs to use these scripts, and the replacement of NLTK with vaderSentiment for sentiment analysis. Additionally, a new test suite ensures the correctness of the bridge integrity checks.

Workflow and Artifact Handling Improvements:

Added a new script, scripts/materialize_etl_artifacts.py, to consistently restore and organize ETL artifacts and frontend assets in the project workspace, replacing scattered shell logic in workflows. (.github/workflows/etl_semanal.yml, scripts/materialize_etl_artifacts.py) [1] [2] [3]
Updated the ETL workflow to use the new artifact materialization script for both previous and current aggregate outputs, ensuring reliable artifact restoration. (.github/workflows/etl_semanal.yml) [1] [2]
Improved artifact verification by checking for required files in both datos/ and datos/latest/, increasing robustness. (.github/workflows/etl_semanal.yml) [1] [2]

Data Integrity and Validation Enhancements:

Introduced scripts/check_bridge_integrity.py, a script that validates the integrity and continuity of frontend bridge datasets before publishing. This is now enforced as a workflow step, with logic to require previous history when appropriate. (.github/workflows/etl_semanal.yml, scripts/check_bridge_integrity.py) [1] [2]
Added a comprehensive test suite for bridge integrity validation, covering both healthy and degraded scenarios. (tests/test_check_bridge_integrity.py)

Backend and Dependency Simplification:

Replaced NLTK with vaderSentiment for sentiment analysis in the Reddit ETL, removing NLTK resource downloads and simplifying requirements. (backend/reddit_etl.py, backend/requirements.txt, .github/workflows/ci.yml, .github/workflows/etl_semanal.yml) [1] [2] [3] [4] [5] [6] [7]
Removed the temporary CVE ignore for NLTK in the dependency security workflow, as NLTK is no longer used. (.github/workflows/dependency_security.yml)

Copilot

Pull request overview

This PR hardens the weekly ETL pipeline by standardizing artifact restoration, adding a bridge integrity gate before publish, and simplifying Reddit sentiment analysis by removing NLTK in favor of vaderSentiment.

Changes:

Introduces scripts/materialize_etl_artifacts.py to restore ETL artifacts (legacy/latest/history/metadata + frontend assets) in a consistent way and wires it into the weekly workflow.
Adds scripts/check_bridge_integrity.py plus a new test suite, and enforces it as a publish gate in the weekly workflow.
Replaces NLTK sentiment usage/download steps with vaderSentiment, updating requirements, CI, and tests accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`.github/workflows/etl_semanal.yml`	Replaces ad-hoc shell restore logic with artifact materialization script; adds bridge integrity gate; removes NLTK download steps.
`.github/workflows/ci.yml`	Removes NLTK data download step now that NLTK is no longer used.
`.github/workflows/dependency_security.yml`	Removes the temporary NLTK CVE ignore and runs `pip-audit` normally.
`scripts/materialize_etl_artifacts.py`	New utility to copy/overlay artifacts into `datos/` + `frontend/assets/data` across supported artifact layouts.
`scripts/check_bridge_integrity.py`	New integrity validator for bridge JSON outputs used as a workflow gate before publish.
`backend/reddit_etl.py`	Switches sentiment analyzer import from NLTK to `vaderSentiment` and removes runtime NLTK resource download step.
`backend/requirements.txt`	Removes `nltk` dependency and adds `vaderSentiment`.
`tests/test_workflow_etl_contract.py`	Updates workflow contract assertions to reflect new script-based workflow steps; adds regression for removed NLTK download step.
`tests/test_reddit_etl.py`	Updates step-count/name assertions after removing the NLTK preparation step.
`tests/test_materialize_etl_artifacts.py`	New tests covering both nested `datos/` and flattened artifact layouts, plus overlay order semantics.
`tests/test_check_bridge_integrity.py`	New tests for bridge integrity checks across healthy, collapsed-history, and bootstrap scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T23:54:34Z

+    home_highlights = _load_json(assets_root / "home_highlights.json")
+    highlights = home_highlights.get("highlights", [])
+    minimum_highlights = 3 if expect_previous_history else 2
+    if len(highlights) < minimum_highlights:
+        errors.append(
+            f"home_highlights highlights={len(highlights)} < {minimum_highlights}"
+        )


check_bridge_integrity hard-fails when home_highlights.highlights has fewer than 3 entries whenever expect_previous_history=True. However export_history_json.build_home_highlights_payload() can legitimately emit <3 highlights when some upstream bridge summaries are null/empty (it only adds candidates when a specific payload is present). This makes the integrity gate brittle and can block publishes even when the bridge files are structurally valid and the UI can run in degraded mode. Consider validating the JSON shape (e.g., highlights is a list, candidate_count >= len(highlights)) and using a lower/conditional minimum (or tie the minimum to candidate_count) instead of a fixed 3.

fix(ci): harden weekly etl integrity and remove nltk

a72bbad

Copilot AI review requested due to automatic review settings March 23, 2026 23:47

Copilot started reviewing on behalf of Sam-24-dev March 23, 2026 23:47 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Sam-24-dev merged commit 34a53c4 into main Mar 24, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): harden weekly etl integrity and remove nltk#17

fix(ci): harden weekly etl integrity and remove nltk#17
Sam-24-dev merged 1 commit into
mainfrom
codex/fix-weekly-etl-integrity

Sam-24-dev commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sam-24-dev commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants