Skip to content

Fail mark-for-deployment when re-deploying same version without --wait-for-deployment#4307

Draft
cuza wants to merge 8 commits into
masterfrom
u/cuza/PAASTA-18862
Draft

Fail mark-for-deployment when re-deploying same version without --wait-for-deployment#4307
cuza wants to merge 8 commits into
masterfrom
u/cuza/PAASTA-18862

Conversation

@cuza

@cuza cuza commented May 7, 2026

Copy link
Copy Markdown
Member

Prevent deployment when attempting to redeploy the same version without the --wait-for-deployment flag.

)
print(deployment_version)
print("Continuing anyway.")
if not args.block:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we'd want to flip this, no?
args.block is True with --wait-for-deployment, and we want to wait until the deploy group is healthy in that case rather than forging ahead?

f"what is set to be deployed in deploy group {deploy_group}:"
)
print(f" {deployment_version}")
print("Checking if all instances are healthy before proceeding...")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we might also want to do something slightly different here - i think we probably want to then essentially pretend that we're doing a normal --wait-for-deployment bounce and poll until the deploy group is empty (and then timeout after whatever we have the usual timeout set to)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., we want to treat an unhealthy deploy group as if it was previously on another version and wait until the "new" (really the same version, we're just re-polling again) version is healthy before continuing

(and if --wait-for-deployment is not set, then don't do anything different: just yolo as usual)

Comment on lines +504 to +528
instance_health = [
check_if_instance_is_done(
service=service,
instance=instance_config.get_instance(),
cluster=cluster,
version=deployment_version,
instance_config=instance_config,
)
for cluster, instance_configs in instance_configs_per_cluster.items()
for instance_config in instance_configs
]
all_healthy = all(instance_health)
if all_healthy:
print(
"All instances are healthy at this version. "
"Safe to proceed to the next deploy group."
)
return 0
else:
print(
"Error: Not all instances are healthy for this version. "
"A previous deploy may have failed or timed out. "
"Not safe to proceed to the next deploy group."
)
return 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth essentially doing what a normal m-f-d does nad run this logic in a loop until all the instances are healthy (or i guess some percentage are healthy - i think we only require bounce_margin_factor % of instances to be healthy to proceed?) rather than doing the check once and exiting

i think doing the logic in a loop would probably also allow us to remove the special-casing here since if everything is healthy, we'd excit that loop immediately and if not, we'd keep rechecking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants