-
Notifications
You must be signed in to change notification settings - Fork 55
Retry webhook on DB errors #3528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3528 +/- ##
==========================================
+ Coverage 89.86% 89.89% +0.03%
==========================================
Files 380 381 +1
Lines 15469 15609 +140
==========================================
+ Hits 13901 14032 +131
- Misses 1568 1577 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…etry) + tests Introduce with_retry/2 and with_webhook_retry/2 with exponential backoff, optional jitter, and DBConnection.ConnectionError default predicate. Emit telemetry (:lightning, :retry, ...).
…ormalization Implement API.webhook_retry/0,/1 with defaults (attempts, delays, backoff, timeout, jitter) and value clamping. Add tests that delegate via Lightning.MockConfig.
…imeout from retry timeout Load WEBHOOK_RETRY_* into :webhook_retry when present. Set LightningWeb.Endpoint http.protocol_options.idle_timeout to max(60_000, timeout_ms + 15_000). Tests stub Lightning.Config via Mox and assert idle_timeout behaviors.
Add 'Webhook Retry Configuration' to deployment docs and sample WEBHOOK_RETRY_* vars to .env.example with guidance.
…n exhaustion Wrap WorkOrders.create_for with Retry.with_webhook_retry and include context telemetry. On DBConnection.ConnectionError exhaustion, respond 503 with Retry-After (timeout_ms/1000). Update/extend controller tests for success-after-retry and final 503.
91ccd4b
to
70f71b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @elias-ba , great job, I mostly have questions here.
- Did you consider that if
max_retries=5
then we could potentially retry10 times
since we're retrying twice? - Now that we're capturing the
DBConnection
exception, will our monitoring services ever pick it up? (sentry, prometheus ..) - I'm surprised that the
WebhookAuth
plug was placed afterPlug.Parsers
all along. Great catch, I wonder if there was a reason it was moved later. Here is the original commit that you did which placed it beforePlug.Parsers
. Could you please double check as to why this was changed? - I see you've updated the error response payload, I have a feeling this might break workflows else where. I'm okay with it but could you just check with the implementation team to see if they ever match on the error response?
- Did you verify that these changes work okay on the billing app?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to do the manual tests (and look at some of the tests around bootstrap and config) but I am afraid I have run out of brain today and I don't want to take the chance that GH eats today's work overnight.
I flagged some cases in Lightning.Retry
where some transformations of configuration values do not have test coverage (e.g. in next_base_delay
). That stuff is tricky to test so you need to decide if it is worth the effort. If you do think so, I would suggest putting all of that sort of stuff into a module of its own - that way you can measure the changes made to config settings by looking directly at the output of the transformation rather than having to try and figure it out from retry behaviour.
defp build_config(opts) do | ||
merged = Keyword.merge(@default_opts, opts) | ||
|
||
%{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end | ||
|
||
defp calculate_next_delay(base_delay, %{jitter: true}) when base_delay > 0 do | ||
max_jitter = div(base_delay, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delay | ||
|> Kernel.*(config.backoff_factor) | ||
|> trunc() | ||
|> min(config.max_delay_ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
| **Variable** | **Description** | **Default** | | ||
| -------------------------------- | ------------------------------------------------------------------------------------------------ | ----------: | | ||
| `WEBHOOK_RETRY_MAX_ATTEMPTS` | Maximum number of attempts (the first attempt runs immediately; backoffs occur between retries). | `5` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first implementation of this, it feels like we have a lot of knobs that we can fiddle with - my bias is towards wondering if there is space for a simpler implementation that allows us to iterate towards the additional complexity as we see what is required?
@@ -487,16 +491,20 @@ defmodule Lightning.Config.Bootstrap do | |||
|
|||
url_scheme = env!("URL_SCHEME", :string, "https") | |||
|
|||
retry_timeout_ms = Lightning.Config.webhook_retry(:timeout_ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to decide why this feels weird. I am not sure if you are familiar with the concept of "layering" as suggested by Eric Evans but this feels like it might reversing a pattern, here.
I.e. in my head, I think of the "layers" within how config gets accessed as follows:
Other application code
|_ Lightning.Config
|_ stuff in bootstrap
And by referring to Lightning.Config inside it bootstrap, it feels like we are reversing that 'layering' (quotes because it has been a long time since I read the DDD book so I may be mangling it :). Especially as Lightning.Config overrides the 'default_webhook_retry` values unconditionally?
As a rule of thumb I try to avoid this kind of stuff as it can get me into trouble, but it is subjective.
@impl true | ||
def webhook_retry do | ||
default_webhook_retry() | ||
|> Keyword.merge(Application.get_env(:lightning, :webhook_retry, [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this happens unconditionally, could we not just set defaults in bootstrap and do away with default_webhook_retry
? That seems to be a pattern that is quite common?
Mimic.copy(Lightning.Retry) | ||
|
||
Mimic.expect(Lightning.Retry, :with_webhook_retry, fn _fun, _opts -> | ||
{:error, %DBConnection.ConnectionError{message: "db down"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Mimic allow matching of arguments passed to the method being mocked? If not, I imagine you could perhaps do pattern matching in the function that handles the response?
For me, when I am writing a test where the code is calling another function, the two questions I want answered are:
- Does the code I am testing call the method correctly?
- Does the code I am testing do the right thing with the method response?
I think we may be missing coverage of the former, so it would be great if we could use Mimic for that.
|> Repo.preload([:workflow, :edges, :webhook_auth_methods]) | ||
|
||
refute conn.halted | ||
assert conn.assigns[:trigger] == expected_trigger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where I am testing code that is performing a lookup, I like to have "negative" examples if it is cheap to do so - e.g. at least one other instance of trigger (in this case) that I am not interested to try and offer some reassurance that my code will still work when there is more than one instance of the thing that I am looking for (i.e. the way things would be in a non-test env).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Jambaar! Round 2 done - more silly questions. I just need to run the manual tests and then I am done :).
@@ -30,7 +33,16 @@ defmodule Lightning.Config.BootstrapTest do | |||
|
|||
describe "prod" do | |||
setup do | |||
Process.put(@opts_key, {:prod, ""}) | |||
Process.put({Config, :opts}, {:prod, ""}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage to using {Config, opts}
instead of @opts_key
? Or @config_key
, @import_key
?
@@ -471,4 +530,18 @@ defmodule Lightning.Config.BootstrapTest do | |||
nil -> nil | |||
end | |||
end | |||
|
|||
defp reconfigure(envs) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Given that it these helpers re only used in a single block of tests, how would you feel about moving the method definitions to be closer to the tests that use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it worth extending this method a little bit so that we can also use it in the setup
block?
timeout_ms: env!("WEBHOOK_RETRY_TIMEOUT_MS", :integer, nil), | ||
jitter: env!("WEBHOOK_RETRY_JITTER", &Utils.ensure_boolean/1, nil) | ||
] | ||
|> Enum.reject(fn {_, value} -> is_nil(value) end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - although someone should add a compact
method for Elixir! :P
Is there any readability value to code-golfing this to extend the pipeline into a cond
to replace the if
?
] | ||
end | ||
|
||
defp normalize_retry(opts) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of overlap between this and Retry.build_config
? Is there a case to be made for deferring the normalisation to build_config
?
result = | ||
Retry.with_retry( | ||
fn -> | ||
:counters.add(attempts, 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL - thanks, very cool!
end | ||
|
||
@spec retriable_error?(term()) :: boolean() | ||
def retriable_error?({:error, %DBConnection.ConnectionError{}}), do: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a duplicate of default_retry_check
- what if we made retriable_error?
the default for retriable_on
, then you would need one less argument in the two current calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jerejef @elias-ba - manual tests done - all looks good. Great job - none of my comments are showstoppers, so feel free to implement any/none of the suggestions :) - I am going to approve in the meantime so that I do not block you, but happy to do follow-ups if you require!
@elias-ba Almost, forgot! This: |
Hey @midigofrank I handled most of your change requests / questions. Thanks a lot for your eagle eyes on this. This is really great. Here are few responses to your general questions
|
Hey @rorymckinley sorry that commit about telemetry never got into this PR, I am happy you asked. You can find all the telemetry events now in this commit |
Description
Adds resilient webhook processing:
WebhooksController#create
now retries transient database connection errors using a newLightning.Retry
helper with exponential backoff and optional jitter. Retry behavior is configurable viaLightning.Config.webhook_retry
(optionally set byWEBHOOK_RETRY_*
env vars). If retries are exhausted, the endpoint returns 503 Service Unavailable with a Retry-After header based on the configuredtimeout_ms
.Closes #3097
Validation steps
Create a workflow and copy its webhook URL
(Any workflow with a webhook trigger is fine.)
Happy path still works
Expected:
200 OK
with body like:.env
and restart the appCreate or edit
.env
in the project root:Reload env and start the server (bash/zsh):
env $(cat .env | grep -v "#" | xargs) iex -S mix phx.server
Stop Postgres (cmd above).
Expected:
503 Service Unavailable
Retry-After: 5
Hit the same endpoint again (or any valid
/i/:id
):Expected (from
WebhookAuth
plug):503 Service Unavailable
Retry-After: 5
timeout_ms
).200
and awork_order_id
(no 503).GET <webhook_url>
→200
with “Make a POST request…” message.Expected:
415 Unsupported Media Type
with{"error":"Unsupported Media Type"}
.Run the POST from step 6 with the DB down and check the app logs. You should see lines like:
retry sleeping attempt=... delay_ms=...
retry exhausted attempts=...
retry succeeded after ... attempts
Additional notes for the reviewer
Idle timeout: Default is now
max(60_000, retry_timeout_ms + 15_000)
to avoid the HTTP connection closing while webhook DB retries are in progress.Error shape (limits): Rate/usage limit responses now include an error code and message.
402 → {"error":"runs_hard_limit","message":"Runs limit exceeded"}
429 → {"error":"too_many_requests","message":"Too many runs in the last minute"}
Telemetry:
Lightning.Retry
emits:start
,:attempt
,:stop
,:exhausted
,:timeout
under[:lightning, :retry, ...]
.Docs & envs:
DEPLOYMENT.md
and.env.example
documentWEBHOOK_RETRY_*
.Backwards compatibility: If no
WEBHOOK_RETRY_*
envs are set, sensible defaults apply; existing behavior remains unchanged.AI Usage
Please tick what applies for this PR:
You can read more details in our Responsible AI Policy
Pre-submission checklist