Skip to content

Automatically rollback after persistent crashlooping#4328

Open
nemacysts wants to merge 1 commit into
masterfrom
auto-rollback-on-crashloop
Open

Automatically rollback after persistent crashlooping#4328
nemacysts wants to merge 1 commit into
masterfrom
auto-rollback-on-crashloop

Conversation

@nemacysts

Copy link
Copy Markdown
Member

If we're crashing consistently, something has probably gone wrong and we don't want to keep trying to deploy in the interest of not wasting compute on broken code :)

(and also, so that we don't leave a timebomb after the deployment times out if no one is watching: should this happen, the next deployment would skip the broken deploy group and impact prod)

NOTE: The tests are all claude-authored (which is great 'cause mocking some of the paasta api bits would have been kinda annoying by hand atm :p)

that said, I've also done some manual testing in the paasta playground to ensure that the tests aren't providing me a false sense of security:

(note: cb9e0385 is a "good" SHA and an actual commit in the compute-infra-test-service repo, and c1413ea9 only exists on my devbox and is an image that always runs exit 1)

with crashloop rollback enabled and --time-before-first-diagnosis 10 --diagnosis-interval 10 set when marking c1413ea9:

Would update the slack thread with: Marked `c1413ea9` for prod.main.
Would update the slack thread with: Will automatically roll back in 30 seconds, (at 12:07:21)! Click "Disable auto rollbacks :close_eyes_monkey:" to cancel this!
Would update the slack thread with: Time's up, will now automatically roll back.
Would update the slack thread with: Marked `cb9e0385` for prod.main.

and then same thing with crashloop rollback disabled (and the m-f-d wrapped with a timeout 90:
none of the rollback messages showed up and the command was killed after 90 seconds, but i still saw the diagnosis output show up

If we're crashing consistently, something has probably gone wrong and we
don't want to keep trying to deploy in the interest of not wasting
compute on broken code :)

(and also, so that we don't leave a timebomb after the deployment
times out if no one is watching: should this happen, the next deployment
would skip the broken deploy group and impact prod)
@nemacysts nemacysts requested a review from a team as a code owner May 28, 2026 21:53
Comment on lines +1651 to +1652
# XXX: might be worth extracting the status call so that we don't have to return a value from here
# to prevent making multiple potentially expensive status calls

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for context: on y-m, the per-instance status endpoint seems to take 2-10+ seconds per-instance :(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but i also didn't want to refactor too much at once)

Comment on lines +726 to +727
# our large monoliths, will the PaaSTA API always return a response fast enough for us
# to actually have probed for failures N times?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer: atm, this is gonna be pretty painful for y-m given the # of instances and how long it takes to get a single instances status

Comment on lines +713 to +715
self.crashloop_auto_rollback_enabled = (
load_system_paasta_config().get_enable_crashloop_auto_rollback()
)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming nothing blows up, this should be deleted pretty quickly since this really only makes sense for jenkins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant