Skip to content

Conversation

OrangeFlag
Copy link

Hi,
First of all, thank you for this excellent library! It has been incredibly useful in my projects.

I’m submitting this pull request to add a feature that enables dynamic backoff support. In some cases, the called service may return a Retry-After header and expects us to wait for the specified time before retrying. However, this header is not always present. Additionally, there are scenarios where I’d like to set a zero backoff for certain errors, allowing an immediate retry.

This pull request introduces the following modifications:

  • Added a new error constructor, ErrWithBackoff, to specify dynamic backoff durations.
  • Updated RunFn to handle ErrWithBackoff by dynamically setting the backoff duration based on the error field.
  • Added a test to verify the dynamic backoff logic.

I believe this feature will make the library more flexible and useful in various retry scenarios.
Thank you for considering this contribution!

Copy link
Owner

@eapache eapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I've left some minor comments in the code.

It's been a while at this point since I've done much work in go, so I'm not as familiar with conventions around error-handling and typing anymore. Is it usual to define a new wrapping error class to handle cases like this? My first instinct would have been to define an interface for types that have a Backoff() time.Duration method instead, and let user errors implement that interface if they wish. But I don't know what actually ends up being easier, or what the current conventions are.

t.Error("run wrong number of times")
}

if time.Since(st) < 500*time.Millisecond {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to avoid doing time-based assertions in the test suite, they tend to be quite flaky.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test results may be flaky, but I couldn't think of any other way to check 🤔

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I don't assert on how long it actually sleeps (I trust go's timer to work) and I simply unit-test the return value of calcSleep (and then the bigger tests just assert on the number of times that the method gets retried, which should be deterministic).

Perhaps for this PR you could pass err into calcSleep and move the logic into that method, so you can test that logic in a unit test of that method without any time-based assertions?

r := New([]time.Duration{0, 10 * time.Millisecond}, nil)
st := time.Now()

err := r.Run(genWork([]error{ErrWithBackoff(errFoo, 500*time.Millisecond)}))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to swap these times (so the retrier defaults to 500ms and the error overrides it to 10ms), so the test runs that much faster.

backoff time.Duration
}

func ErrWithBackoff(err error, backoff time.Duration) error {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add godocs for any new public types/methods/etc.

var err *errWithBackoff
var backoff time.Duration
if errors.As(ret, &err) {
backoff = err.backoff
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the response gives us a backoff value, does it still make sense to add some jitter to that? I'm not honestly sure...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with eapache's point.
I don't think Jitter should be unnecessarily considered by the client when given by Retry-After.
The client should add the Jitter because it is the same program or could be executed at the same time, and the Retry-After reverses the timing decision.

@tukeJonny
Copy link
Contributor

@OrangeFlag @eapache

Sorry to jump in but let me explain further.
As a fan of go-resiliency's retry packages, I am curious to offer my opinion.

I think it's strange to assume that the use of Retry-After, etc. will be based on the existing mechanism for generating backoffs.

What I mean is that in cases where an error occurs and the Retry-After returns, it is unlikely that there will be a mix of cases where it returns and cases where it does not return, so I think the point is that Retry-After (like, dynamically generated values) should be clearly segregated as a compliant back-off.
Also, the current backoffs are mismatched with the basic design of pre-generated values.

The current design you are proposing confuses the two and is quite implicit, making the existing backoff a design that “lies”.

The work functions that Run and RunCtx receive would be very difficult to modify in the current situation.
So, why not have two functions, RunWithDynamicBackoff and RunCtxWithDynamicBackoff?
In this case, we would rely on a “dynamic” backoff to be executed in the RunX function that would generate a backoff for each call.

I would like to solicit input from others on the name, but I think it is a reasonably significant change to have an inherently dynamic treatment enter into a static treatment.

I wouldn't be interrupting PR like this if I could discuss it first in an Issue, but I'm sorry if I'm stirring up the discussion.

@tukeJonny
Copy link
Contributor

Thanks for your contribution! I've left some minor comments in the code.

It's been a while at this point since I've done much work in go, so I'm not as familiar with conventions around error-handling and typing anymore. Is it usual to define a new wrapping error class to handle cases like this? My first instinct would have been to define an interface for types that have a Backoff() time.Duration method instead, and let user errors implement that interface if they wish. But I don't know what actually ends up being easier, or what the current conventions are.

I also think that this current code is just screwing with errors and should ideally be in the form of defining a Backoff interface.

Errors should be used for handling as values, but using them as a substitute for messaging seems pretty forced.

The above my suggestions were made to avoid destructive changes.
Also, if static and dynamic behaviors are mixed, it will be easy to introduce bugs and also difficult for users to understand, so why not prepare them in a way that they can be segregated? This is the reason why i suggested to create a separate form for the two.

@eapache
Copy link
Owner

eapache commented Feb 23, 2025

Hmm, to be honest it's the constructor (New) which takes the static list of backoff values. If you wanted to design a proper dynamic retry interface, it would involve a new constructor as well (either taking a function reference in the constructor, or leaving it up to the work function). But if you're baking more logic into the work function so it can return the backoff value, then does the classifier make sense either, or should that also just be a value the work function returns directly? Or should the classifier be responsible for calculating the dynamic backoff? Lots of questions / lots of options, maybe best designed as a fresh type.

@OrangeFlag I'm still open to some version of this feature, but I think it is a good point to consider the interface a little more carefully.

@tukeJonny
Copy link
Contributor

@eapache

Thanks for the reply.
It seems like a reasonable idea to me as well, that the backoff should be known at the constructor.
The “processing” is passed by function reference (work) when processing, but the “how to define the backoff value” is something that the retry executor must know.

On the other hand, I recognize that the Classifier is there to determine success or failure and to determine the next action to take.
In this PR proposal, we are talking about how to generate backoffs, assuming a situation where the client is not in a position to determine backoffs, so isn't this a different story? I believe that this is the case.
Even if the work function were to return a backoff value, would it not change whether the next action taken is a retry or not, and would it not only take longer to wait?

I believe it is excessive for the Classifier to have the responsibility of generating backoffs.
If we do that, users will always have to pay attention to the Classifier when they are concerned about handling backoffs or extending them, and we will have a situation where there are multiple concerns.
The reverse is also true.

Against the scale, this PR seems a little too brave.
How about we reopen the discussion with an Issue?
In particular, what is needed in the design to separate constructors and how to incorporate them.

@eapache
Copy link
Owner

eapache commented Feb 23, 2025

Yes, if somebody is interested in working on it, then that larger discussion should be happening in a different issue or PR at this point - the ideas being floated here are much bigger than Ilya initially proposed.

@eapache
Copy link
Owner

eapache commented Feb 23, 2025

I'm going to leave this PR open for the time being - this is a reasonably small extension of the existing implementation that doesn't really get in the way of existing users. While an ideal dynamic retrier probably has a fairly different interface, I don't want to let the perfect be the enemy of the good; If somebody wants to tackle the bigger problem, that's great, but if they just want to address my existing more specific comments, I'm not opposed to merging this anyway.

@yandzee
Copy link

yandzee commented Mar 23, 2025

Hey everyone!
I got here randomly while was checking if rate limiting module is there and read the entire discussion :)

Thx @eapache for this library, I'm currently using it.

My understanding on the topic, is that its not the responsibility of running function to decide on backoff value, in the same way as its not the responsibility of Retrier to calcSleep. So I would move all such logic into the interface like:

type Backoff interface {
    Duration(attempt int, err error) time.Duration
}

This approach would allow one to build their own backoff by chaining the existing one with custom error checking function. I could make a PR for that if you want guys.

wdyt?

@eapache
Copy link
Owner

eapache commented Mar 23, 2025

I think something like that interface could make sense, yes. As discussed though, switching from []time.Duration to something like that starts to impact the constructor and overall design of the type, which makes it not a very straight-forward addition, perhaps even warranting a whole new type/implementation. Happy to look at PRs if there is interest.

This PR is no longer the right place to be having this discussion though. Please file a ticket for any more discussions if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants