enhancement/715/parallelize-etl-gha by santoshgdev · Pull Request #728 · sfbrigade/datasci-earthquake

santoshgdev · 2025-11-15T03:38:38Z

Description

To palatalize the ETL steps, i've moved before and after it into their own steps and made the actual etl step a matrix strategy.

Type of changes

( ) Bugfix
( ) Chore
(x) New Feature

Testing

( ) I added automated tests
( ) I think tests are unnecessary

How to test

testing description here: i.e. run app, go to x page, see that it does y

Clean commits

(x ) I plan to Squash and Merge
( ) My commit history is clean¹
¹ described here

vercel · 2025-11-15T03:38:44Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
datasci-earthquake	Ready	Preview	Comment	Nov 20, 2025 3:42am

qredwoods

Thanks Santosh! Defer to Anna on approval, but the condensing lines into a structure we iterate over like in line 69 LGTM and is a good style/prep for expansion, expect pattern will recur many times across project I think since I'm guessing there was a history of adding one data set at a time.

Tangent - (and sorry if my terms are blurry here) - Maybe we can generalize that to a style guideline to encourage it specifically around when handling that expected to grow trio - there was also discussion in the case of a prior issue about trying to abstract at a higher level of things where liquefaction, soft_story, and tsunami were swapped in. In the future I think @agennadi we could look at if there's a design possible that minimizes the need to add an nth thing in every place in the code when an nth data layer is added by maybe making it a single reference that automatically goes through all the layers/sets. Maybe the principle is "when something touches all of the data layers (ie tsunami, soft-story AND liquefaction), take advantage of opportunities to limit repetition of code" as Santosh has done here.

qredwoods · 2025-11-20T04:52:02Z

I saw Anna's thumbs up on my comment above -- if that means @agennadi you support PR approval I can go ahead and approve this or you also can directly.

qredwoods · 2025-11-20T04:54:15Z

(Also to follow up on my point above about standardizing that we always try to combine the layers -- in tonight's meeting, I was successfully convinced it doesn't make sense to do that. Anna pointed out that in other parts of the code the treatment of each dataset is necessarily very distinct due to the differences between datasets -- for example liquefaction and tsunami zones being mapped polygon based, whereas soft-story is address-based data. Therefore when there's code duplication like here or moreso #695, sure, take opportunities to consolidate, but let's not design away the flexibility in pursuit of arbitrary unity, especially as each new layer will bring additional unique traits that require special-case handling. )

agennadi

I ran the workflow from a test branch (715/test) and the jobs are failing (check out the logs here: https://github.com/sfbrigade/datasci-earthquake/actions/runs/19606231119).

This happens because each github job has an isolated environment. The pre-etl job installs python and the dependencies but run-etl knows nothing about them and the job fails with uv: command not found error.

post-etl fails with fatal: not a git repository (or any of the parent directories): .git for the same reason - there is no repository checkout in this job.

Thus, each job would require its own setup and/or checkout. Otherwise, we could give up the matrices and go with bash parallel execution using & - what do you think?

santoshgdev added 2 commits November 14, 2025 19:35

Update etl_to_neon.yml

d3637bc

Update etl_to_neon.yml

2502b3a

santoshgdev requested review from agennadi and qredwoods November 15, 2025 03:38

vercel bot deployed to Preview November 15, 2025 03:39 View deployment

qredwoods reviewed Nov 15, 2025

View reviewed changes

Merge branch 'develop' into enhancement/715/parallelize-etl-gha

b92f20f

santoshgdev marked this pull request as ready for review November 20, 2025 03:41

vercel bot deployed to Preview November 20, 2025 03:42 View deployment

agennadi requested changes Nov 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement/715/parallelize-etl-gha#728

enhancement/715/parallelize-etl-gha#728
santoshgdev wants to merge 3 commits intodevelopfrom
enhancement/715/parallelize-etl-gha

santoshgdev commented Nov 15, 2025

Uh oh!

vercel bot commented Nov 15, 2025 •

edited

Loading

Uh oh!

qredwoods left a comment

Uh oh!

qredwoods commented Nov 20, 2025

Uh oh!

qredwoods commented Nov 20, 2025

Uh oh!

agennadi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

santoshgdev commented Nov 15, 2025

Description

Type of changes

Testing

How to test

Clean commits

Uh oh!

vercel bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qredwoods left a comment

Choose a reason for hiding this comment

Uh oh!

qredwoods commented Nov 20, 2025

Uh oh!

qredwoods commented Nov 20, 2025

Uh oh!

agennadi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Nov 15, 2025 •

edited

Loading