-
Notifications
You must be signed in to change notification settings - Fork 114
[CI] Export scraped commit data to a BigQuery dataset #532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you've chosen BigQuery here over any of the other GCP services?
I don't think this dataset is ever going to end up being particularly "big". Not sure if there is easier integration with internal tooling or something though.
Mostly looks reasonable enough to me, just a couple questions.
is_reviewed: bool = False | ||
is_approved: bool = False | ||
reviewers: set[str] = dataclasses.field(default_factory=set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the inconsistency here? You default initialize this, but don't default initialize commit_sha
, commit_timestamp_seconds, or
files_modified. You also use a set here when both
files_modifiedand
reviewers` could be represented as sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit_sha
, commit_timestamp_seconds
and files_modified
will always be available from the initial scrape of the repo, so they don't require a default value. There are a handful of fields that will cannot be set if a particular commit does not have an associated PR (can't determine list of reviewers without a PR, for example), so the default values are there to ensure that those fields are populated when uploading to grafana/bigquery
As far as typing goes, I agree about using sets for both files_modified
and reviewers
. I've updated the class to reflect that
@@ -142,27 +163,33 @@ def query_for_reviews( | |||
) | |||
if response.status_code < 200 or response.status_code >= 300: | |||
logging.error("Failed to query GitHub GraphQL API: %s", response.text) | |||
exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on why we want to fail hard instead of gracefully continuing?
I'm presuming because if we miss an entire batch that's a pretty large chunk of data so failing gracefully doesn't make much sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I've added a comment per your suggestion
Internal tooling is the primary reason, it's fairly straight forward for us to access BigQuery data internally without much overhead. |
Currently, the data we scrape and process regarding LLVM commits isn't persistent and cannot be referenced outside of each CronJob invocation. This change uploads scraped and parsed LLVM commit data to a new BigQuery dataset, so that we may access and reuse this data without having to requery and reparse the same commits to llvm-project.