[CI] Export scraped commit data to a BigQuery dataset #532

jriv01 · 2025-07-31T21:19:05Z

Currently, the data we scrape and process regarding LLVM commits isn't persistent and cannot be referenced outside of each CronJob invocation. This change uploads scraped and parsed LLVM commit data to a new BigQuery dataset, so that we may access and reuse this data without having to requery and reparse the same commits to llvm-project.

jriv01 · 2025-07-31T21:25:20Z

@boomanaiden154 @cmtice @lnihlen

boomanaiden154

Is there a reason you've chosen BigQuery here over any of the other GCP services?

I don't think this dataset is ever going to end up being particularly "big". Not sure if there is easier integration with internal tooling or something though.

Mostly looks reasonable enough to me, just a couple questions.

boomanaiden154 · 2025-08-01T15:28:40Z

llvm-ops-metrics/ops-container/process_llvm_commits.py

+  pull_request_number: int = 0
  is_reviewed: bool = False
  is_approved: bool = False
+  reviewers: set[str] = dataclasses.field(default_factory=set)


Why the inconsistency here? You default initialize this, but don't default initialize commit_sha, commit_timestamp_seconds, or files_modified. You also use a set here when both files_modifiedandreviewers` could be represented as sets.

commit_sha, commit_timestamp_seconds and files_modified will always be available from the initial scrape of the repo, so they don't require a default value. There are a handful of fields that will cannot be set if a particular commit does not have an associated PR (can't determine list of reviewers without a PR, for example), so the default values are there to ensure that those fields are populated when uploading to grafana/bigquery

As far as typing goes, I agree about using sets for both files_modified and reviewers. I've updated the class to reflect that

boomanaiden154 · 2025-08-01T15:29:42Z

llvm-ops-metrics/ops-container/process_llvm_commits.py

    )
    if response.status_code < 200 or response.status_code >= 300:
      logging.error("Failed to query GitHub GraphQL API: %s", response.text)
+      exit(1)


Can you add a comment on why we want to fail hard instead of gracefully continuing?

I'm presuming because if we miss an entire batch that's a pretty large chunk of data so failing gracefully doesn't make much sense?

That's correct, I've added a comment per your suggestion

jriv01 · 2025-08-01T17:40:31Z

Is there a reason you've chosen BigQuery here over any of the other GCP services?

I don't think this dataset is ever going to end up being particularly "big". Not sure if there is easier integration with internal tooling or something though.

Internal tooling is the primary reason, it's fairly straight forward for us to access BigQuery data internally without much overhead.

Currently, the data we scrape and process regarding LLVM commits isn't persistent and cannot be referenced outside of each CronJob invocation. This change uploads scraped and parsed LLVM commit data to a new BigQuery dataset, so that we may access and reuse this data without having to requery and reparse the same commits to llvm-project.

jriv01 added 2 commits July 31, 2025 21:15

[CI] Export scraped commit data to a BigQuery dataset

d9ddedb

Update bigquery table reference in python script

42a8c38

boomanaiden154 reviewed Aug 1, 2025

View reviewed changes

jriv01 added 2 commits August 1, 2025 17:25

Change type of files_modified data field

cf1c414

Add comment explaining why failed api call exits execution

cb81db0

boomanaiden154 approved these changes Aug 1, 2025

View reviewed changes

boomanaiden154 merged commit ff8bf02 into llvm:main Aug 1, 2025
5 checks passed

jriv01 deleted the bigquery-dataset branch August 4, 2025 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Export scraped commit data to a BigQuery dataset #532

[CI] Export scraped commit data to a BigQuery dataset #532

Uh oh!

jriv01 commented Jul 31, 2025

Uh oh!

jriv01 commented Jul 31, 2025

Uh oh!

boomanaiden154 left a comment

Uh oh!

boomanaiden154 Aug 1, 2025

Uh oh!

jriv01 Aug 1, 2025

Uh oh!

boomanaiden154 Aug 1, 2025

Uh oh!

jriv01 Aug 1, 2025

Uh oh!

jriv01 commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CI] Export scraped commit data to a BigQuery dataset #532

[CI] Export scraped commit data to a BigQuery dataset #532

Uh oh!

Conversation

jriv01 commented Jul 31, 2025

Uh oh!

jriv01 commented Jul 31, 2025

Uh oh!

boomanaiden154 left a comment

Choose a reason for hiding this comment

Uh oh!

boomanaiden154 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

jriv01 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

boomanaiden154 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

jriv01 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

jriv01 commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants