Skip to content

Conversation

internet-diglett
Copy link
Contributor

@internet-diglett internet-diglett commented Aug 26, 2025

This PR addresses two issues that impacts users in circumstances where a switch zone is unavailable:

  1. When resolving which switch slot is managed by which switch zone, we would continuously retry whenever there was a communication error. This causes RPWs to stop running / Sagas to get stuck whenever a switch zone becomes unavailable but still has entries in DNS.

  2. In the dpd_ensure node of the instance_start saga, we were bailing out if we encountered any errors while notifying dpd of nat changes. This means that in the event of one of our switch zones being unavailable we would no longer allow users to start instances, which is probably a bit too strict.

This PR makes the following adjustments to the switch slot resolution behavior:

  • Log an error instead of a warning whenever a DNS entry exists for a switch zone service but communication fails
  • Return the mappings of the Slot -> Address for the switch zones that we were successfully able to resolve
  • Let the caller decide whether or not they should retry or proceed. After checking all of the current call sites, it appears that the logic already accommodates scenarios where switch zone mappings are missing.

This PR makes the following changes to the behavior to the NAT configuration steps for instance related sagas:

  • Do not bail if we fail to notify a dendrite service of NAT changes. Log a warning and rely on the NAT RPW to catch us back up whenever the switch zone becomes available again. Still return a Result from the notify function so that callers can decide if they want to make a subsequent decision based on the error.

Related

#8206
#6896

When resolving which switch slot is managed by
which switch zone, we would continuously retry
whenever there was a communication error. This
causes RPWs to stop running / Sagas to get stuck
whenever a switch zone becomes unavailable but
still has entries in DNS.

These changes have the following adjustment to
the resolution behavior:

* Log an error instead of a warning whenever
  a DNS entry exists for a switch zone service
  but communication fails
* Return the mappings of the Slot -> Address
  for the switch zones that we were successfully
  able to resolve
* Let the caller decided whether or not they
  should retry or proceed. After checking all
  of the current call sites, it appears that
  the logic already accomodates scenarios where
  switch zone mappings are missing.
Previously in `dpd_ensure` we were bailing out if we encounter
any errors while notifying dpd of nat changes. This means that
in the event of one of our switch zones being unavailable
we would no longer allow users to start instances, which is
probably a bit too strict.

Also, since the dendrite daemons have a NAT reconciliation RPW, we can
rely on it to catch us up instead of hard failing whenever there is
a failure to notify them of updates.
{
warn!(
log,
"error encountered when notifying dendrite, NAT entry creation may be delayed";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mind wrapping this line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants