Skip to content

Enhancement Request: implement Retry with Backoff pattern for network I/O requests #137

@klmcwhirter

Description

@klmcwhirter

Overview

Relying on network-online.target only is problematic as documented in several places.

The uupd.{timer,service} units rely on network-online.target. But that does not mean what it sounds like. It does not mean that the network is ready for traffic. See https://systemd.io/NETWORK_ONLINE/.

I should mention that I use wifi exclusively - not wired ethernet.

When opening the laptop lid (Wake from Sleep), I keep seeing errors like this:

Jan 15 04:57:11 nitro5 uupd[157219]: {"level":"ERROR","msg":"Hardware checks failed","error":"Network, returned error: network not online"}

Hence, the need for the Retry with Backoff pattern to be implemented in the uupd code itself.

Note that the Circuit Breaker pattern might also be beneficial for situations like CloudFlare introducing a change that disallowed connectivity to dependent assets.

Example Use Case

I have written software, that is similarly scheduled by systemd where the solution was to implement the Retry with Backoff pattern to prevent false positive reporting.

The systemd units also rely on network-online.target. And because requirements allowed for it, also sets the BindsTo=gnome-session.target property as I was guided in a forum somewhere; although I was not able to make that advice work for me. I do realize that the uupd requirements do not allow for assuming an authenticated gnome session.

Once I implemented the Retry with Backoff pattern in the rule (written in python) needing to use ssh reliably, I started to see log entries like the following and resilience was improved.

Feb 02 05:24:36 nitro5 upd_monitor.sh[390387]: 2026-02-02 05:24:36,796 - steps - WARNING - process_rule - Worker-04: <class '__mp_main__.ProcessNonZeroRetcodeError'>, Retrying in 3 seconds...
Feb 02 05:24:36 nitro5 upd_monitor.sh[390387]: <class '__mp_main__.ProcessNonZeroRetcodeError'>, Retrying in 6 seconds...
Feb 02 05:24:36 nitro5 upd_monitor.sh[390387]: 2026-02-02 05:24:36,796 - steps - DEBUG - process_rule - Worker-04: Processing pi-cluster-health ... done
Feb 02 05:24:36 nitro5 upd_monitor.sh[390326]: 2026-02-02 05:24:36,796 - steps - INFO - process - Received result from Worker-04
Feb 02 05:24:36 nitro5 upd_monitor.sh[390326]: 2026-02-02 05:24:36,831 - utils - DEBUG - _ - process done
Feb 02 05:24:36 nitro5 systemd[3708]: upd-indicator-monitor.service: Consumed 3.901s CPU time, 656.2M memory peak.

Questions

  1. Does this resound with the team's sensibilities regarding industry best practices?
  2. Do you agree that the uupd golang code is the correct architectural location for implementing these kinds of reliability / resilience policies?
  3. Is the Wake from Sleep event something you want to officially support? As I mentioned elsewhere, using ujust update as a manual workaround has worked fine for me throughout 2025.

Disclaimer

Although I have quite a bit of experience guiding teams to implement reliability and resilience policies in their application architecture on my day job, I will not pretend to claim a similar level of familiarity here. I.e., my experience guides me to know it is needed, but not how to implement it in a Universal Blue system component.

My hope is to encourage discussion; not complain.

Current Version

$ sudo bootc status
● Booted image: ghcr.io/ublue-os/bluefin-dx-nvidia-open:stable
        Digest: sha256:43279a9dfad55057c3f977a33b5d0b64e3e659fb6565c371f449af190dfc20ca (amd64)
       Version: 43.20260127 (2026-01-27T01:52:09Z)

Related Issue

Note that part of the problem was resolved with #126.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions