Bugfix 8206 #8914

internet-diglett · 2025-08-26T19:06:39Z

This PR addresses two issues that impacts users in circumstances where a switch zone is unavailable:

When resolving which switch slot is managed by which switch zone, we would continuously retry whenever there was a communication error. This causes RPWs to stop running / Sagas to get stuck whenever a switch zone becomes unavailable but still has entries in DNS.
In the dpd_ensure node of the instance_start saga, we were bailing out if we encountered any errors while notifying dpd of nat changes. This means that in the event of one of our switch zones being unavailable we would no longer allow users to start instances, which is probably a bit too strict.

This PR makes the following adjustments to the switch slot resolution behavior:

Log an error instead of a warning whenever a DNS entry exists for a switch zone service but communication fails
Return the mappings of the Slot -> Address for the switch zones that we were successfully able to resolve
Let the caller decide whether or not they should retry or proceed. After checking all of the current call sites, it appears that the logic already accommodates scenarios where switch zone mappings are missing.

This PR makes the following changes to the behavior to the NAT configuration steps for instance related sagas:

Do not bail if we fail to notify a dendrite service of NAT changes. Log a warning and rely on the NAT RPW to catch us back up whenever the switch zone becomes available again. Still return a Result from the notify function so that callers can decide if they want to make a subsequent decision based on the error.

When resolving which switch slot is managed by which switch zone, we would continuously retry whenever there was a communication error. This causes RPWs to stop running / Sagas to get stuck whenever a switch zone becomes unavailable but still has entries in DNS. These changes have the following adjustment to the resolution behavior: * Log an error instead of a warning whenever a DNS entry exists for a switch zone service but communication fails * Return the mappings of the Slot -> Address for the switch zones that we were successfully able to resolve * Let the caller decided whether or not they should retry or proceed. After checking all of the current call sites, it appears that the logic already accomodates scenarios where switch zone mappings are missing.

Previously in `dpd_ensure` we were bailing out if we encounter any errors while notifying dpd of nat changes. This means that in the event of one of our switch zones being unavailable we would no longer allow users to start instances, which is probably a bit too strict. Also, since the dendrite daemons have a NAT reconciliation RPW, we can rely on it to catch us up instead of hard failing whenever there is a failure to notify them of updates.

hawkw · 2025-08-26T23:57:02Z

nexus/src/app/instance_network.rs

+    {
+        warn!(
+            log,
+            "error encountered when notifying dendrite, NAT entry creation may be delayed";


nit: mind wrapping this line?

internet-diglett added 2 commits August 26, 2025 18:03

hawkw reviewed Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix 8206 #8914

Bugfix 8206 #8914

Uh oh!

internet-diglett commented Aug 26, 2025 •

edited

Loading

Uh oh!

hawkw Aug 26, 2025

Uh oh!

Uh oh!

Bugfix 8206 #8914

Are you sure you want to change the base?

Bugfix 8206 #8914

Uh oh!

Conversation

internet-diglett commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related

Uh oh!

hawkw Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

internet-diglett commented Aug 26, 2025 •

edited

Loading