Skip to content

Conversation

jriv01
Copy link
Contributor

@jriv01 jriv01 commented Jul 31, 2025

Currently, the data we scrape and process regarding LLVM commits isn't persistent and cannot be referenced outside of each CronJob invocation. This change uploads scraped and parsed LLVM commit data to a new BigQuery dataset, so that we may access and reuse this data without having to requery and reparse the same commits to llvm-project.

@jriv01
Copy link
Contributor Author

jriv01 commented Jul 31, 2025

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you've chosen BigQuery here over any of the other GCP services?

I don't think this dataset is ever going to end up being particularly "big". Not sure if there is easier integration with internal tooling or something though.

Mostly looks reasonable enough to me, just a couple questions.

is_reviewed: bool = False
is_approved: bool = False
reviewers: set[str] = dataclasses.field(default_factory=set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the inconsistency here? You default initialize this, but don't default initialize commit_sha, commit_timestamp_seconds, or files_modified. You also use a set here when both files_modifiedandreviewers` could be represented as sets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit_sha, commit_timestamp_seconds and files_modified will always be available from the initial scrape of the repo, so they don't require a default value. There are a handful of fields that will cannot be set if a particular commit does not have an associated PR (can't determine list of reviewers without a PR, for example), so the default values are there to ensure that those fields are populated when uploading to grafana/bigquery

As far as typing goes, I agree about using sets for both files_modified and reviewers. I've updated the class to reflect that

@@ -142,27 +163,33 @@ def query_for_reviews(
)
if response.status_code < 200 or response.status_code >= 300:
logging.error("Failed to query GitHub GraphQL API: %s", response.text)
exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on why we want to fail hard instead of gracefully continuing?

I'm presuming because if we miss an entire batch that's a pretty large chunk of data so failing gracefully doesn't make much sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I've added a comment per your suggestion

@jriv01
Copy link
Contributor Author

jriv01 commented Aug 1, 2025

Is there a reason you've chosen BigQuery here over any of the other GCP services?

I don't think this dataset is ever going to end up being particularly "big". Not sure if there is easier integration with internal tooling or something though.

Internal tooling is the primary reason, it's fairly straight forward for us to access BigQuery data internally without much overhead.

@boomanaiden154 boomanaiden154 merged commit ff8bf02 into llvm:main Aug 1, 2025
5 checks passed
@jriv01 jriv01 deleted the bigquery-dataset branch August 4, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants