Skip to content

Commit 5953b40

Browse files
Add retries design doc
1 parent 5afe217 commit 5953b40

File tree

2 files changed

+176
-1
lines changed

2 files changed

+176
-1
lines changed

designs/exceptions.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ retryability properties will be standardized as a `Protocol` that exceptions MAY
3737
implement.
3838

3939
```python
40-
@dataclass(kw_only=True)
4140
@runtime_checkable
4241
class ErrorRetryInfo(Protocol):
4342
is_retry_safe: bool | None = None
@@ -64,6 +63,8 @@ If an exception with `ErrorRetryInfo` is received while attempting to send a
6463
serialized request to the server, the contained information will be used to
6564
inform the next retry.
6665

66+
See the retry design for more details on how this information is used.
67+
6768
### Service Errors
6869

6970
Errors returned by the service MUST be a `CallError`. `CallError`s include a
@@ -82,6 +83,10 @@ type Fault = Literal["client", "server"] | None
8283
If None, then there was not enough information to determine fault.
8384
"""
8485

86+
@runtime_checkable
87+
class HasFault(Protocol):
88+
fault: Fault
89+
8590

8691
@dataclass(kw_only=True)
8792
class CallError(SmithyError, ErrorRetryInfo):

designs/retries.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Retries
2+
3+
Operation requests might fail for a number of reasons that are unrelated to the
4+
input paramters, such as a transient network issue, or excessive load on the
5+
service. This document describes how Smithy clients will automatically retry in
6+
those cases, and how the retry system can be modified.
7+
8+
## Specification
9+
10+
Retry behavior will be determined by a `RetryStrategy`. Implementations of the
11+
`RetryStrategy` will produce `RetryToken`s that carry metadata about the
12+
invocation, notably the number of attempts that have occurred and the amount of
13+
time that must pass before the next attempt. Passing state through tokens in
14+
this way allows the `RetryStrategy` itself to be isolated from the state of an
15+
individual request.
16+
17+
```python
18+
@dataclass(kw_only=True)
19+
class RetryToken(Protocol):
20+
retry_count: int
21+
"""Retry count is the total number of attempts minus the initial attempt."""
22+
23+
retry_delay: float
24+
"""Delay in seconds to wait before the retry attempt."""
25+
26+
27+
class RetryStrategy(Protocol):
28+
backoff_strategy: RetryBackoffStrategy
29+
"""The strategy used by returned tokens to compute delay duration values."""
30+
31+
max_attempts: int
32+
"""Upper limit on total attempt count (initial attempt plus retries)."""
33+
34+
def acquire_initial_retry_token(
35+
self, *, token_scope: str | None = None
36+
) -> RetryToken:
37+
"""Called before any retries (for the first attempt at the operation).
38+
39+
:param token_scope: An arbitrary string accepted by the retry strategy to
40+
separate tokens into scopes.
41+
:returns: A retry token, to be used for determining the retry delay, refreshing
42+
the token after a failure, and recording success after success.
43+
:raises RetryError: If the retry strategy has no available tokens.
44+
"""
45+
...
46+
47+
def refresh_retry_token_for_retry(
48+
self, *, token_to_renew: RetryToken, error: Exception
49+
) -> RetryToken:
50+
"""Replace an existing retry token from a failed attempt with a new token.
51+
52+
:param token_to_renew: The token used for the previous failed attempt.
53+
:param error: The error that triggered the need for a retry.
54+
:raises RetryError: If no further retry attempts are allowed.
55+
"""
56+
...
57+
58+
def record_success(self, *, token: RetryToken) -> None:
59+
"""Return token after successful completion of an operation.
60+
61+
:param token: The token used for the previous successful attempt.
62+
"""
63+
...
64+
```
65+
66+
A request using a `RetryStrategy` would look something like the following
67+
example:
68+
69+
```python
70+
try:
71+
retry_token = retry_strategy.acquire_initial_retry_token()
72+
except RetryError:
73+
transpoort_response = transport_client.send(serialized_request)
74+
return self._deserialize(transport_response)
75+
76+
while True:
77+
await asyncio.sleep(retry_token.retry_delay)
78+
try:
79+
transpoort_response = transport_client.send(serialized_request)
80+
response = self._deserialize(transport_response)
81+
except Exception as e:
82+
response = e
83+
84+
if isinstance(response, Exception):
85+
try:
86+
retry_token = retry_strategy.refresh_retry_token_for_retry(
87+
token_to_renew=retry_token,
88+
error=e
89+
)
90+
continue
91+
except RetryError retry_error:
92+
raise retry_error from e
93+
94+
retry_strategy.record_success(token=retry_token)
95+
return response
96+
```
97+
98+
### Error Classification
99+
100+
Different types of exceptions may require different amounts of delay or may not
101+
be retryable at all. To facilitate passing important information around,
102+
exceptions may implement the `ErrorRetryInfo` and/or `HasFault` protocols. These
103+
are defined in the exceptions design, but are reproduced here for ease of
104+
reading:
105+
106+
```python
107+
@runtime_checkable
108+
class ErrorRetryInfo(Protocol):
109+
"""A protocol for errors that have retry information embedded."""
110+
111+
is_retry_safe: bool | None = None
112+
"""Whether the error is safe to retry.
113+
114+
A value of True does not mean a retry will occur, but rather that a retry is allowed
115+
to occur.
116+
117+
A value of None indicates that there is not enough information available to
118+
determine if a retry is safe.
119+
"""
120+
121+
retry_after: float | None = None
122+
"""The amount of time that should pass before a retry.
123+
124+
Retry strategies MAY choose to wait longer.
125+
"""
126+
127+
is_throttling_error: bool = False
128+
"""Whether the error is a throttling error."""
129+
130+
131+
type Fault = Literal["client", "server"] | None
132+
"""Whether the client or server is at fault.
133+
134+
If None, then there was not enough information to determine fault.
135+
"""
136+
137+
138+
@runtime_checkable
139+
class HasFault(Protocol):
140+
fault: Fault
141+
```
142+
143+
`RetryStrategy` implementations MUST raise a `RetryError` if they receive an
144+
exception where `is_retry_safe` is `False` and SHOULD raise a `RetryError` if it
145+
is `None`. `RetryStrategy` implementations SHOULD use a delay that is at least
146+
as long as `retry_after` but MAY choose to wait longer.
147+
148+
### Backoff Strategy
149+
150+
Each `RetryStrategy` has a configurable `RetryBackoffStrategy`. This is a
151+
stateless class that computes the next backoff delay based solely on the number
152+
of retry attempts.
153+
154+
```python
155+
class RetryBackoffStrategy(Protocol):
156+
def compute_next_backoff_delay(self, retry_attempt: int) -> float:
157+
...
158+
```
159+
160+
Backoff strategies can be as simple as waiting a number of seconds equal to the
161+
number of retry attempts, but that initial delay would be unacceptably long. A
162+
default backoff strategy called `ExponentialRetryBackoffStrategy` is available
163+
that uses exponential backoff with configurable jitter.
164+
165+
Having the backoff calculation be stateless and separate allows the
166+
`BackoffStrategy` to handle any extra context that may have wider scope. For
167+
example, a `BackoffStrategy` could use a token bucket to limit retries
168+
client-wide so that the client can limit the amount of load it is placing on the
169+
server. Decoupling this logic from the straightforward math of delay computation
170+
allows both components to be evolved separately.

0 commit comments

Comments
 (0)