Skip to content

enhancement/715/parallelize-etl-gha#728

Open
santoshgdev wants to merge 3 commits intodevelopfrom
enhancement/715/parallelize-etl-gha
Open

enhancement/715/parallelize-etl-gha#728
santoshgdev wants to merge 3 commits intodevelopfrom
enhancement/715/parallelize-etl-gha

Conversation

@santoshgdev
Copy link
Collaborator

Description

To palatalize the ETL steps, i've moved before and after it into their own steps and made the actual etl step a matrix strategy.

Type of changes

  • ( ) Bugfix
  • ( ) Chore
  • (x) New Feature

Testing

  • ( ) I added automated tests
  • ( ) I think tests are unnecessary

How to test

testing description here: i.e. run app, go to x page, see that it does y

Clean commits

  • (x ) I plan to Squash and Merge
  • ( ) My commit history is clean¹
    ¹ described here

@vercel
Copy link

vercel bot commented Nov 15, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
datasci-earthquake Ready Ready Preview Comment Nov 20, 2025 3:42am

Copy link
Collaborator

@qredwoods qredwoods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Santosh! Defer to Anna on approval, but the condensing lines into a structure we iterate over like in line 69 LGTM and is a good style/prep for expansion, expect pattern will recur many times across project I think since I'm guessing there was a history of adding one data set at a time.

Tangent - (and sorry if my terms are blurry here) - Maybe we can generalize that to a style guideline to encourage it specifically around when handling that expected to grow trio - there was also discussion in the case of a prior issue about trying to abstract at a higher level of things where liquefaction, soft_story, and tsunami were swapped in. In the future I think @agennadi we could look at if there's a design possible that minimizes the need to add an nth thing in every place in the code when an nth data layer is added by maybe making it a single reference that automatically goes through all the layers/sets. Maybe the principle is "when something touches all of the data layers (ie tsunami, soft-story AND liquefaction), take advantage of opportunities to limit repetition of code" as Santosh has done here.

@qredwoods
Copy link
Collaborator

I saw Anna's thumbs up on my comment above -- if that means @agennadi you support PR approval I can go ahead and approve this or you also can directly.

@qredwoods
Copy link
Collaborator

(Also to follow up on my point above about standardizing that we always try to combine the layers -- in tonight's meeting, I was successfully convinced it doesn't make sense to do that. Anna pointed out that in other parts of the code the treatment of each dataset is necessarily very distinct due to the differences between datasets -- for example liquefaction and tsunami zones being mapped polygon based, whereas soft-story is address-based data. Therefore when there's code duplication like here or moreso #695, sure, take opportunities to consolidate, but let's not design away the flexibility in pursuit of arbitrary unity, especially as each new layer will bring additional unique traits that require special-case handling. )

Copy link
Collaborator

@agennadi agennadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the workflow from a test branch (715/test) and the jobs are failing (check out the logs here: https://github.com/sfbrigade/datasci-earthquake/actions/runs/19606231119).

This happens because each github job has an isolated environment. The pre-etl job installs python and the dependencies but run-etl knows nothing about them and the job fails with uv: command not found error.

post-etl fails with fatal: not a git repository (or any of the parent directories): .git for the same reason - there is no repository checkout in this job.

Thus, each job would require its own setup and/or checkout. Otherwise, we could give up the matrices and go with bash parallel execution using & - what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants