Automatically rollback after persistent crashlooping#4328
Open
nemacysts wants to merge 1 commit into
Open
Conversation
If we're crashing consistently, something has probably gone wrong and we don't want to keep trying to deploy in the interest of not wasting compute on broken code :) (and also, so that we don't leave a timebomb after the deployment times out if no one is watching: should this happen, the next deployment would skip the broken deploy group and impact prod)
nemacysts
commented
May 28, 2026
Comment on lines
+1651
to
+1652
| # XXX: might be worth extracting the status call so that we don't have to return a value from here | ||
| # to prevent making multiple potentially expensive status calls |
Member
Author
There was a problem hiding this comment.
for context: on y-m, the per-instance status endpoint seems to take 2-10+ seconds per-instance :(
Member
Author
There was a problem hiding this comment.
(but i also didn't want to refactor too much at once)
Comment on lines
+726
to
+727
| # our large monoliths, will the PaaSTA API always return a response fast enough for us | ||
| # to actually have probed for failures N times? |
Member
Author
There was a problem hiding this comment.
answer: atm, this is gonna be pretty painful for y-m given the # of instances and how long it takes to get a single instances status
Comment on lines
+713
to
+715
| self.crashloop_auto_rollback_enabled = ( | ||
| load_system_paasta_config().get_enable_crashloop_auto_rollback() | ||
| ) |
Member
Author
There was a problem hiding this comment.
assuming nothing blows up, this should be deleted pretty quickly since this really only makes sense for jenkins
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If we're crashing consistently, something has probably gone wrong and we don't want to keep trying to deploy in the interest of not wasting compute on broken code :)
(and also, so that we don't leave a timebomb after the deployment times out if no one is watching: should this happen, the next deployment would skip the broken deploy group and impact prod)
NOTE: The tests are all
claude-authored (which is great 'cause mocking some of the paasta api bits would have been kinda annoying by hand atm :p)that said, I've also done some manual testing in the paasta playground to ensure that the tests aren't providing me a false sense of security:
(note:
cb9e0385is a "good" SHA and an actual commit in the compute-infra-test-service repo, andc1413ea9only exists on my devbox and is an image that always runsexit 1)with crashloop rollback enabled and
--time-before-first-diagnosis 10 --diagnosis-interval 10set when marking c1413ea9:and then same thing with crashloop rollback disabled (and the m-f-d wrapped with a
timeout 90:none of the rollback messages showed up and the command was killed after 90 seconds, but i still saw the diagnosis output show up