Skip to content

Conversation

joshua-holmes
Copy link

@joshua-holmes joshua-holmes commented Sep 3, 2025

Problem

Like in #17472, the package manager does not retry fetching when communication to the server fails in a way that could possibly be resolved by retrying the fetch.

Solution

Automatically retry the fetch in these scenarios during http fetches:

  1. Connecting to the server fails
  2. Sending the payload to the server fails
  3. Server returns a non-200 status code

and theses scenarios during git+http fetches:

  1. Discovering remote git server capabilities fails
  2. Creating a fetch stream fails

This does not expose this fetch config to the end user, for reasons you can read about in the next section. Definitely worth the conversation, but getting this PR out there as is doesn't seem like a bad idea to me.

Major Considerations

  • I thought about how I can expose configuration of the retry behavior to the user (the retry_count and retry_delay_ms). However, this proved difficult, so it is not a part of this PR. Here are some thoughts I had though:
    • I cannot expose config in the build.zig file, because (as far as I can tell) the build.zig file is not run until all the packages are fetched and ready to be built with the source code.
    • I do not want to include this config for each dep (dependency) because - how do we handle this for deps of deps? We don't want the lib maintainer to determine how many times I fetch the lib's deps. That's the end user's decision. So deps of deps can inherit the config, but then the end user isn't getting the control they are "promised" by having the config attached to each dep.
    • I do not want to include the config in the body of the manifest. Feels out of place, since that body isn't supposed to be about fetch config for dependencies, it's supposed to be about defining the current package along with pointing to it's dependent packages.
    • Maybe we allow config to be passed in the body of .dependencies, like this (below) so that .retry_count and other props apply to all dependencies, but it's still a part of the .dependencies object. Also, this can easily be parsed by checking the type of the prop. If the type is anytype, it's a real dep, if it's anything else, it's config code, not a real dep. Fun idea, but it's not a part of this PR and shouldn't be. Here is an example of this idea:
      .dependencies = .{
          .retry_count = 3,
          .retry_delay_ms = 1000,
          .zf = .{
              .url = "https://github.com/natecraddock/zf/archive/7aacbe6d155d64d15937ca95ca6c014905eb531f.tar.gz",
              .hash = "zf-0.10.3-OIRy8aiIAACLrBllz0zjxaH0aOe5oNm3KtEMyCntST-9",
              .lazy = true,
          },
          .vaxis = .{
              .url = "git+https://github.com/rockorager/libvaxis#1f41c121e8fc153d9ce8c6eb64b2bbab68ad7d23",
              .hash = "vaxis-0.1.0-BWNV_FUICQAFZnTCL11TUvnUr1Y0_ZdqtXHhd51d76Rn",
              .lazy = true,
          },
      },
  • Would be cool to show the retry progress like this after it fails the first time. But it might be worth extending std.Progress.Node first for that, so this is also not implemented in this PR:
    Compile Build Script
    └─ [1/2] Fetch Packages
       ├─ vaxis (retry 1/2)
       └─ zf
    

Minor Considerations

  • I intentionally did not attempt to retry during failures that do not seem resolvable by simply refetching, e.g. when receiving the http header because, looking through the possible errors, it seems errors mostly occur because of invalid responses, redirecting, or write failures in that case.
  • There are some locations where memory is being freed that wasn't being freed before. This is because, in the case of the user possibly configuring Fetch to retry many times, I don't want the allocator to allocate without freeing each time it attempts to refetch the network data, resulting in a memory leak.
  • The Fetch.initResource() method needs to be refactored. But this PR is big enough, so it made sense to me to merge this, then do a refactor in a future PR.

There are some locations where memory is being freed that wasn't being
freed before. This is because, in the case of the user possibly
configuring `Fetch` to retry many times, I don't want the allocator to
allocate without freeing each time it attempts to refetch the network
data, resulting in a memory leak.
@joshua-holmes joshua-holmes marked this pull request as ready for review September 3, 2025 01:06
@@ -90,6 +90,12 @@ latest_commit: ?git.Oid,
/// the root source file.
module: ?*Package.Module,

/// The number of times an HTTP request will retry if it fails
retry_count: u32 = 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u32 for a retry counter is wasteful. The HTTP client's redirect counter uses an u16 which still seems on the higher side, but I suppose u8 might not be enough in some particular cases. No one is going to wait for 255 retries though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference:

zig/lib/std/http/Client.zig

Lines 836 to 844 in d51d18c

/// Any value other than `not_allowed` or `unhandled` means that integer represents
/// how many remaining redirects are allowed.
pub const RedirectBehavior = enum(u16) {
/// The next redirect will cause an error.
not_allowed = 0,
/// Redirects are passed to the client to analyze the redirect response
/// directly.
unhandled = std.math.maxInt(u16),
_,

Copy link
Author

@joshua-holmes joshua-holmes Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was initially u8, but I should have bumped it up to u16 instead of u32. Fixed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought with it being larger than u8 though is someone might choose some arbitrarily "large" number like 1000 just so it keeps retrying for a long time. Seems silly, but not so silly that we should say "you can't do that", at least in my opinion

.{ response.head.status, response.head.status.phrase() orelse "" },
));
if (response.head.status != .ok) {
// We only need to retry if we run into server-side errors (5xx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true, package fetching could easily run into a 429 for example. #17472 even mentions spurious 404s.

Copy link
Author

@joshua-holmes joshua-holmes Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point. Fixed!

@ehaas
Copy link
Contributor

ehaas commented Sep 4, 2025

It might be useful to look at the defaults of other package managers for ideas - the cargo source has some good info about retries, including links to other resources: https://doc.rust-lang.org/nightly/nightly-rustc/src/cargo/util/network/retry.rs.html

Some notes:

  • Default retries is 3 (total of 4 attempts)
  • For HTTP failures, try parsing the "Retry-After" header, which could be a number of seconds or an HTTP date string referring to a date in the future
  • Otherwise, first delay is a randomized value between 500ms and 1500ms
  • Subsequent delays are linear backoff 3500ms, 6500ms, etc up to a max of 10 seconds
  • Classification of spurious (retryable) errors in maybe_spurious:
    • Various network-level errors like DNS failure or TCP errors or network timeouts
    • HTTP 5xx or HTTP 429

@joshua-holmes
Copy link
Author

joshua-holmes commented Sep 5, 2025

@ehaas Yeah, I also took a look at the uv python package manager and found very similar behavior. I created this PR with the intention of keeping it really simple to start with and adding complexity if requested. Better that than the reverse, and I'm new here so I don't know where the tolerances are.

I like that list of notes you jotted down though, so I'll start there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants