Automatically rollback after persistent crashlooping by nemacysts · Pull Request #4328 · Yelp/paasta

nemacysts · 2026-05-28T21:53:50Z

If we're crashing consistently, something has probably gone wrong and we don't want to keep trying to deploy in the interest of not wasting compute on broken code :)

(and also, so that we don't leave a timebomb after the deployment times out if no one is watching: should this happen, the next deployment would skip the broken deploy group and impact prod)

NOTE: The tests are all claude-authored (which is great 'cause mocking some of the paasta api bits would have been kinda annoying by hand atm :p)

that said, I've also done some manual testing in the paasta playground to ensure that the tests aren't providing me a false sense of security:

(note: cb9e0385 is a "good" SHA and an actual commit in the compute-infra-test-service repo, and c1413ea9 only exists on my devbox and is an image that always runs exit 1)

with crashloop rollback enabled and --time-before-first-diagnosis 10 --diagnosis-interval 10 set when marking c1413ea9:

Would update the slack thread with: Marked `c1413ea9` for prod.main.
Would update the slack thread with: Will automatically roll back in 30 seconds, (at 12:07:21)! Click "Disable auto rollbacks :close_eyes_monkey:" to cancel this!
Would update the slack thread with: Time's up, will now automatically roll back.
Would update the slack thread with: Marked `cb9e0385` for prod.main.

and then same thing with crashloop rollback disabled (and the m-f-d wrapped with a timeout 90:
none of the rollback messages showed up and the command was killed after 90 seconds, but i still saw the diagnosis output show up

If we're crashing consistently, something has probably gone wrong and we don't want to keep trying to deploy in the interest of not wasting compute on broken code :) (and also, so that we don't leave a timebomb after the deployment times out if no one is watching: should this happen, the next deployment would skip the broken deploy group and impact prod)

nemacysts · 2026-05-28T21:56:15Z

+            # XXX: might be worth extracting the status call so that we don't have to return a value from here
+            # to prevent making multiple potentially expensive status calls


for context: on y-m, the per-instance status endpoint seems to take 2-10+ seconds per-instance :(

(but i also didn't want to refactor too much at once)

nemacysts · 2026-05-28T21:59:00Z

+        # our large monoliths, will the PaaSTA API always return a response fast enough for us
+        # to actually have probed for failures N times?


answer: atm, this is gonna be pretty painful for y-m given the # of instances and how long it takes to get a single instances status

nemacysts · 2026-05-28T21:59:37Z

+        self.crashloop_auto_rollback_enabled = (
+            load_system_paasta_config().get_enable_crashloop_auto_rollback()
+        )


assuming nothing blows up, this should be deleted pretty quickly since this really only makes sense for jenkins

nemacysts requested a review from a team as a code owner May 28, 2026 21:53

nemacysts commented May 28, 2026

View reviewed changes

nemacysts requested review from EvanKrall, cuza, ilkinmammadzada, jfongatyelp, mbrankin-art and sidtuladhar May 28, 2026 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically rollback after persistent crashlooping#4328

Automatically rollback after persistent crashlooping#4328
nemacysts wants to merge 1 commit into
masterfrom
auto-rollback-on-crashloop

nemacysts commented May 28, 2026

Uh oh!

nemacysts May 28, 2026

Uh oh!

nemacysts May 28, 2026

Uh oh!

nemacysts May 28, 2026

Uh oh!

nemacysts May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# XXX: might be worth extracting the status call so that we don't have to return a value from here
		# to prevent making multiple potentially expensive status calls

		# our large monoliths, will the PaaSTA API always return a response fast enough for us
		# to actually have probed for failures N times?

Conversation

nemacysts commented May 28, 2026

Uh oh!

nemacysts May 28, 2026

Choose a reason for hiding this comment

Uh oh!

nemacysts May 28, 2026

Choose a reason for hiding this comment

Uh oh!

nemacysts May 28, 2026

Choose a reason for hiding this comment

Uh oh!

nemacysts May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant