Implement package fetch retries #25120

joshua-holmes · 2025-09-03T00:54:42Z

Problem

Like in #17472, the package manager does not retry fetching when communication to the server fails in a way that could possibly be resolved by retrying the fetch.

Solution

Automatically retry the fetch in these scenarios during http fetches:

Connecting to the server fails
Sending the payload to the server fails
Server returns a non-200 status code

and theses scenarios during git+http fetches:

Discovering remote git server capabilities fails
Creating a fetch stream fails

This does not expose this fetch config to the end user, for reasons you can read about in the next section. Definitely worth the conversation, but getting this PR out there as is doesn't seem like a bad idea to me.

Major Considerations

I thought about how I can expose configuration of the retry behavior to the user (the retry_count and retry_delay_ms). However, this proved difficult, so it is not a part of this PR. Here are some thoughts I had though:
- I cannot expose config in the build.zig file, because (as far as I can tell) the build.zig file is not run until all the packages are fetched and ready to be built with the source code.
- I do not want to include this config for each dep (dependency) because - how do we handle this for deps of deps? We don't want the lib maintainer to determine how many times I fetch the lib's deps. That's the end user's decision. So deps of deps can inherit the config, but then the end user isn't getting the control they are "promised" by having the config attached to each dep.
- I do not want to include the config in the body of the manifest. Feels out of place, since that body isn't supposed to be about fetch config for dependencies, it's supposed to be about defining the current package along with pointing to it's dependent packages.
- Maybe we allow config to be passed in the body of .dependencies, like this (below) so that .retry_count and other props apply to all dependencies, but it's still a part of the .dependencies object. Also, this can easily be parsed by checking the type of the prop. If the type is anytype, it's a real dep, if it's anything else, it's config code, not a real dep. Fun idea, but it's not a part of this PR and shouldn't be. Here is an example of this idea:
```
.dependencies = .{
    .retry_count = 3,
    .retry_delay_ms = 1000,
    .zf = .{
        .url = "https://github.com/natecraddock/zf/archive/7aacbe6d155d64d15937ca95ca6c014905eb531f.tar.gz",
        .hash = "zf-0.10.3-OIRy8aiIAACLrBllz0zjxaH0aOe5oNm3KtEMyCntST-9",
        .lazy = true,
    },
    .vaxis = .{
        .url = "git+https://github.com/rockorager/libvaxis#1f41c121e8fc153d9ce8c6eb64b2bbab68ad7d23",
        .hash = "vaxis-0.1.0-BWNV_FUICQAFZnTCL11TUvnUr1Y0_ZdqtXHhd51d76Rn",
        .lazy = true,
    },
},
```
Would be cool to show the retry progress like this after it fails the first time. But it might be worth extending std.Progress.Node first for that, so this is also not implemented in this PR:
```
Compile Build Script
└─ [1/2] Fetch Packages
   ├─ vaxis (retry 1/2)
   └─ zf
```

Minor Considerations

I intentionally did not attempt to retry during failures that do not seem resolvable by simply refetching, e.g. when receiving the http header because, looking through the possible errors, it seems errors mostly occur because of invalid responses, redirecting, or write failures in that case.
There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring Fetch to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.
The Fetch.initResource() method needs to be refactored. But this PR is big enough, so it made sense to me to merge this, then do a refactor in a future PR.

There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring `Fetch` to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.

linusg · 2025-09-03T01:54:54Z

src/Package/Fetch.zig

@@ -90,6 +90,12 @@ latest_commit: ?git.Oid,
 /// the root source file.
 module: ?*Package.Module,

+/// The number of times an HTTP request will retry if it fails
+retry_count: u32 = 2,


u32 for a retry counter is wasteful. The HTTP client's redirect counter uses an u16 which still seems on the higher side, but I suppose u8 might not be enough in some particular cases. No one is going to wait for 255 retries though.

For reference:

zig/lib/std/http/Client.zig

Lines 836 to 844 in d51d18c

/// Any value other than `not_allowed` or `unhandled` means that integer represents

/// how many remaining redirects are allowed.

pub const RedirectBehavior = enum(u16) {

/// The next redirect will cause an error.

not_allowed = 0,

/// Redirects are passed to the client to analyze the redirect response

/// directly.

unhandled = std.math.maxInt(u16),

_,

Yeah, it was initially u8, but I should have bumped it up to u16 instead of u32. Fixed!

My thought with it being larger than u8 though is someone might choose some arbitrarily "large" number like 1000 just so it keeps retrying for a long time. Seems silly, but not so silly that we should say "you can't do that", at least in my opinion

linusg · 2025-09-03T01:56:27Z

src/Package/Fetch.zig

-            .{ response.head.status, response.head.status.phrase() orelse "" },
-        ));
+            if (response.head.status != .ok) {
+                // We only need to retry if we run into server-side errors (5xx)


That's not true, package fetching could easily run into a 429 for example. #17472 even mentions spurious 404s.

Oh that's a good point. Fixed!

ehaas · 2025-09-04T16:12:25Z

It might be useful to look at the defaults of other package managers for ideas - the cargo source has some good info about retries, including links to other resources: https://doc.rust-lang.org/nightly/nightly-rustc/src/cargo/util/network/retry.rs.html

Some notes:

Default retries is 3 (total of 4 attempts)
For HTTP failures, try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future
Otherwise, first delay is a randomized value between 500ms and 1500ms
Subsequent delays are linear backoff 3500ms, 6500ms, etc up to a max of 10 seconds
Classification of spurious (retryable) errors in maybe_spurious:
- Various network-level errors like DNS failure or TCP errors or network timeouts
- HTTP 5xx or HTTP 429

joshua-holmes · 2025-09-05T01:22:48Z

@ehaas Yeah, I also took a look at the uv python package manager and found very similar behavior. I created this PR with the intention of keeping it really simple to start with and adding complexity if requested. Better that than the reverse, and I'm new here so I don't know where the tolerances are.

I like that list of notes you jotted down though, so I'll start there.

joshua-holmes marked this pull request as ready for review September 3, 2025 01:06

linusg reviewed Sep 3, 2025

View reviewed changes

joshua-holmes added 2 commits September 3, 2025 10:54

Use u16 instead of u32 for retry_count

0158df7

Always retry when status code is not "ok"

58fbf12

joshua-holmes force-pushed the master branch from 2d62411 to 58fbf12 Compare September 5, 2025 01:33

Set default retries to 3 (total of 4 attempts)

ddcc15a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement package fetch retries #25120

Implement package fetch retries #25120

joshua-holmes commented Sep 3, 2025 •

edited

Loading

Uh oh!

linusg Sep 3, 2025

Uh oh!

linusg Sep 3, 2025

Uh oh!

joshua-holmes Sep 3, 2025 •

edited

Loading

Uh oh!

joshua-holmes Sep 3, 2025

Uh oh!

linusg Sep 3, 2025

Uh oh!

joshua-holmes Sep 3, 2025 •

edited

Loading

Uh oh!

ehaas commented Sep 4, 2025

Uh oh!

joshua-holmes commented Sep 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

	/// Any value other than `not_allowed` or `unhandled` means that integer represents
	/// how many remaining redirects are allowed.
	pub const RedirectBehavior = enum(u16) {
	/// The next redirect will cause an error.
	not_allowed = 0,
	/// Redirects are passed to the client to analyze the redirect response
	/// directly.
	unhandled = std.math.maxInt(u16),
	_,

Uh oh!

Implement package fetch retries #25120

Are you sure you want to change the base?

Implement package fetch retries #25120

Conversation

joshua-holmes commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Major Considerations

Minor Considerations

Uh oh!

linusg Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

linusg Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-holmes Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshua-holmes Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

linusg Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-holmes Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehaas commented Sep 4, 2025

Uh oh!

joshua-holmes commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

joshua-holmes commented Sep 3, 2025 •

edited

Loading

joshua-holmes Sep 3, 2025 •

edited

Loading

joshua-holmes Sep 3, 2025 •

edited

Loading

joshua-holmes commented Sep 5, 2025 •

edited

Loading