-
Notifications
You must be signed in to change notification settings - Fork 119
[CI] Add design document for post submit testing #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
41c3fe9
3e365fe
0c14060
2073c1e
384363f
1eee6c7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,169 @@ | ||||||
| # Post Submit Testing | ||||||
|
|
||||||
| ## Introduction | ||||||
|
|
||||||
| While this infrastructure is focused on premerge testing, it is also important | ||||||
| to make sure that the specific configuration we are testing is tested post | ||||||
| commit as well. This document outlines the motivation for the need to test this | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| configuration post commit, why we are utilizing this design over others, and | ||||||
| how we plan on implementing this to ensure we get fast feedback scalably. | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ## Background/Motivation | ||||||
|
|
||||||
| LLVM has two types of testing upstream: premerge and postcommit. The premerge | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default assumption for this document is going to be that "LLVM" refers to upstream llvm, so you can just say "testing" I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a bit redundant, but I'd prefer to keep it in there. There will probably be internal (to Google) audiences reading this who might not be as familiar with everything upstream and we have a lot of downstream CI, so keeping it explicit might make it more clear for someone and only be redundant for someone else. |
||||||
| testing is performed using Github Actions every time a PR is updated before it | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| is merged. Premerge testing is performed using this infrastructure (specifically | ||||||
| the `./premerge` folder in llvm-zorg). Landing a PR consists of squashing the | ||||||
| changes into a single commit and adding that commit to the `main` branch in the | ||||||
| monorepo. We care specifically about the state of the `main` branch because it | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| is what the community considers the canonical tree. Currently, commits can also | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| be added to the `main` branch by directly pushing to the main branch. After a | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| new commit lands in the `main` branch, postcommit testing is performed. Most | ||||||
| postcommit testing is performed through the Buildbot infrastructure. The main | ||||||
| Buildbot instance for LLVM has a web instance hosted at | ||||||
| [lab.llvm.org](https:/lab.llvm.org). When a new commit lands in `main` the | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| Buildbot instance (sometimes referred to as the Buildbot master) will trigger | ||||||
|
||||||
| many different build configurations. These configurations are defined in the | ||||||
| llvm-zorg repository under the `buildbot/` folder. These configurations are run | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| on Buildbot workers that are hosted by the community. | ||||||
|
|
||||||
| It is important that we test the premerge configuration postcommit as well. We | ||||||
|
||||||
| need to be able to determine the state of `main` (in terms of whether the build | ||||||
| passed/failed and what tests failed, if any) for certain types of automation. | ||||||
| Postcommit testing enables easily checking the state of `main` at any given point | ||||||
| in time. This data is crucial for figuring out which commit to revert/fix | ||||||
| forward to get everything back to green. Having information on the state of | ||||||
| `main` is also important for certain kinds of automation, like the planned | ||||||
| premerge testing advisor that will let contributors know if tests failing in | ||||||
| their PR are already failing at `main` and that it should be safe to merge | ||||||
| despite the given failures. | ||||||
|
||||||
|
|
||||||
| ## Design | ||||||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| The LLVM Premerge system has two clusters, namely the central cluster in the GCP | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| zone `us-central1-a` and the west cluster in the GCP zone `us-west1`. We run | ||||||
| two clusters in different zones for redundancy so that if one fails, we can | ||||||
| still run jobs. For postcommit testing, we plan on setting up builders attached | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| to the Buildbot master described above. We will run one builder on the central | ||||||
| cluster and one in the west cluster. This ensures the configuration is highly | ||||||
| available (able to tolerate an entire cluster going down), similar to the | ||||||
| premerge testing. The builders will be configured to use a script that will | ||||||
| launch builds on each commit to `main` in a similar configuration to the one run | ||||||
| premerge. The test configuration is intended to be close to the premerge | ||||||
|
||||||
| configuration but will be different in some key ways. The differences and | ||||||
| motivation for them is described more thoroughly in the | ||||||
| [testing configuration](#testing-configuration) section. These builds will be | ||||||
| run inside containers that are distributed onto the cluster inside kubernetes | ||||||
| pods (the fundamental schedulable unit inside kubernetes). This allows for | ||||||
| kubernetes to handle details like what machine a build should run on. Allowing | ||||||
| kubernetes to handle these details also enables GKE to autoscale the node pools | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| so we are not paying for uneeded capacity. Launching builds inside pods | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| also allows for each builder to handle multiple builds at the same time. | ||||||
|
|
||||||
| In terms of the full flow, any commit (which can be from direct pushes or | ||||||
| merging pull requests) pushed to the LLVM monorepo will get detected by the | ||||||
| buildbot master. The Buildbot master will invoke Buildbot workers running on our | ||||||
| clusters. These Buildbot workers will use custom builders to launch a build | ||||||
| wrapped in a kubernetes pod and report the results back to the buildbot master. | ||||||
| When the job is finished, the pod will complete and capacity will be available | ||||||
| for another build, or if there is nothing left to test GKE will see that there | ||||||
| is nothing running on one of the nodes and downscale the node pool. | ||||||
|
|
||||||
| ### Annotated Builder | ||||||
|
|
||||||
| llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder. | ||||||
| AnnotatedBuilders allow for the build to be driven using a custom python script | ||||||
| rather than directly dictating the shell commands that should be run to perform | ||||||
| the build. We need the flexibility of the AnnotatedBuilder to deploy jobs on the | ||||||
| cluster. AnnotatedBuilder based builders also enable deploying changes without | ||||||
| needing to restart the buildbot master. Without this, we have to wait for an | ||||||
| administrator of the LLVM buildbot master to restart it before our changes get | ||||||
| deployed. This could significantly delay updates or responses to incidents, | ||||||
| especially before the system is fully stable and we need to modify it relatively | ||||||
| frequently. | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### Build Distribution | ||||||
|
|
||||||
| We want to be able to take advantage of the autoscaling functionality of the | ||||||
| new cluster to efficiently utilize resources. To do this, we plan on having the | ||||||
| AnnotatedBuilder script launch builds as kubernetes pods. This allows for | ||||||
| kubernetes to assign the builds to nodes and also allows autoscaling through | ||||||
| the same mechanism that Github autoscales. This allows for us to quickly | ||||||
|
||||||
| process builds at peak times and not pay for extra capacity when commit | ||||||
| traffic is quiet. This is essentially for ensuring our resource use is | ||||||
| efficient while still providing fast feedback. | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Using the kubernetes API inside of a python script to launch builds does add | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| some complexity. However, we do not believe we need too much added | ||||||
| complexity to achieve our goal here and this allows for vastly more efficient | ||||||
| resource usage. | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### Testing Configuration | ||||||
|
|
||||||
| The testing configuration will be as close to the premerge configuration as | ||||||
| possible. We will be running all tests inside the same container with the | ||||||
|
||||||
| same scripts (the `monolithic-linux.sh` and `monolithic-windows.sh` scripts). | ||||||
| However, there will be one main difference between the premerge and postcommit | ||||||
| configuration. In the postcommit configuration we propose testing all projects | ||||||
| on every commit rather than only testing the projects that themselves changed | ||||||
|
||||||
| or had dependencies that changed. We propose this for two main reasons. | ||||||
| Firstly, Buildbot does not have support for heterogenous build configurations. | ||||||
| This means that testing a different set of projects in a single build | ||||||
| configuration depending upon what files changed could easily produce many | ||||||
| more notifications if certain configurations were failing and some were | ||||||
| passing which defeats the point of using Buildbot in the first place. For | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| example, if there is a MLIR change that fails, an unrelated clang-tidy change | ||||||
| that passes all tests that lands afterwards, and then another MLIR change, a | ||||||
| notification will also be sent out on the second MLIR change because the | ||||||
| clang-tidy change turned the build back to green. We also explicitly do not | ||||||
| test certain projects even though their dependencies change, and while we do | ||||||
|
||||||
| this because we suspect interactions resulting in test failures would be quite | ||||||
| rare, it is possible, and having a postcommit configuration catch these rare | ||||||
| failures would be useful. | ||||||
|
|
||||||
| ### Data Storage | ||||||
|
|
||||||
| The hosted Buildbot master instance at [lab.llvm.org](https://lab.llvm.org) | ||||||
| contains results for all recent postcommit runs. We plan on quetying the results | ||||||
boomanaiden154 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| from the buildbot master because they are already available and that is where | ||||||
| they will natively be reported after the infrastructure is set up. Buildbot | ||||||
| supports a [REST API](https://docs.buildbot.net/latest/developer/rest.html) that | ||||||
| would allow for easily querying the state of a commit in `main`. | ||||||
|
|
||||||
| For the proposed premerge advisor that tells the user what tests/build failures | ||||||
|
||||||
| For the proposed premerge advisor that tells the user what tests/build failures | |
| In the future, we may implement a "premerge advisor" that tells the user what tests/build failures |
Uh oh!
There was an error while loading. Please reload this page.