Retry webhook on DB errors #3528

elias-ba · 2025-08-19T01:31:41Z

Description

Adds resilient webhook processing: WebhooksController#create now retries transient database connection errors using a new Lightning.Retry helper with exponential backoff and optional jitter. Retry behavior is configurable via Lightning.Config.webhook_retry (optionally set by WEBHOOK_RETRY_* env vars). If retries are exhausted, the endpoint returns 503 Service Unavailable with a Retry-After header based on the configured timeout_ms.

Closes #3097

Validation steps

Postgres controls I used (macOS packaged Postgres 17):

# start
sudo -u postgres /Library/PostgreSQL/17/bin/pg_ctl \
  -D /Library/PostgreSQL/17/data start -m fast

# stop
sudo -u postgres /Library/PostgreSQL/17/bin/pg_ctl \
  -D /Library/PostgreSQL/17/data stop -m fast

Start app and DB

Start Postgres (cmd above)
Start the Phoenix app

Create a workflow and copy its webhook URL
(Any workflow with a webhook trigger is fine.)
Happy path still works

curl -i -X POST <webhook_url> \
  -H 'content-type: application/json' \
  -d '{}'

Expected: 200 OK with body like:

{"work_order_id":"<uuid>"}

Configure retry via .env and restart the app

Create or edit .env in the project root:

WEBHOOK_RETRY_MAX_ATTEMPTS=3
WEBHOOK_RETRY_INITIAL_DELAY_MS=100
WEBHOOK_RETRY_TIMEOUT_MS=5000   # 5s total retry budget → Retry-After=5
WEBHOOK_RETRY_JITTER=false      # optional; keeps timings deterministic

Reload env and start the server (bash/zsh):

env $(cat .env | grep -v "#" | xargs) iex -S mix phx.server

Simulate DB outage

Stop Postgres (cmd above).

POST while DB is down → controller returns 503 + Retry-After

curl -i -X POST <webhook_url> \
  -H 'content-type: application/json' \
  -d '{}'

Expected:

Status: 503 Service Unavailable
Header: Retry-After: 5
Body:

{
  "error": "service_unavailable",
  "message": "Unable to process request due to temporary database issues. Please try again in 5s.",
  "retry_after": 5
}

Plug path also returns 503 when lookup fails (DB still down)

Hit the same endpoint again (or any valid /i/:id):

curl -i -X POST <webhook_url> -H 'content-type: application/json' -d '{}'

Expected (from WebhookAuth plug):

Status: 503 Service Unavailable
Header: Retry-After: 5
Body:

{
  "error": "service_unavailable",
  "message": "Temporary database issue during webhook lookup. Please retry in 5s.",
  "retry_after": 5
}

Recovery: request eventually succeeds if DB comes back before timeout

With DB still stopped, run the POST from step 3 in one terminal.
Quickly start Postgres in another terminal within 5s (the configured timeout_ms).
The in-flight request should complete with 200 and a work_order_id (no 503).

Regression checks (unchanged behavior)

GET <webhook_url> → 200 with “Make a POST request…” message.
Send unsupported media type:

curl -i -X POST <webhook_url> \
  -H 'content-type: text/xml' -d '{}'

Expected: 415 Unsupported Media Type with {"error":"Unsupported Media Type"}.

(Optional) Observe logs

Run the POST from step 6 with the DB down and check the app logs. You should see lines like:

retry sleeping attempt=... delay_ms=...
retry exhausted attempts=...
or, on success after a retry: retry succeeded after ... attempts

Additional notes for the reviewer

Idle timeout: Default is now max(60_000, retry_timeout_ms + 15_000) to avoid the HTTP connection closing while webhook DB retries are in progress.
Error shape (limits): Rate/usage limit responses now include an error code and message.
- 402 → {"error":"runs_hard_limit","message":"Runs limit exceeded"}
- 429 → {"error":"too_many_requests","message":"Too many runs in the last minute"}
Telemetry: Lightning.Retry emits :start, :attempt, :stop, :exhausted, :timeout under [:lightning, :retry, ...].
Docs & envs: DEPLOYMENT.md and .env.example document WEBHOOK_RETRY_*.
Backwards compatibility: If no WEBHOOK_RETRY_* envs are set, sensible defaults apply; existing behavior remains unchanged.

AI Usage

Please tick what applies for this PR:

You can read more details in our Responsible AI Policy

Pre-submission checklist

I have performed a self-review of my code.
I have implemented and tested all related authorization policies (N/A for this change; controller path already guarded).
I have updated the changelog.
I have ticked a box in "AI usage" in this PR.

codecov · 2025-08-22T09:43:02Z

Codecov Report

❌ Patch coverage is 94.83871% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.89%. Comparing base (ecbe7d0) to head (dd9687f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
lib/lightning/retry.ex	92.04%	7 Missing ⚠️
lib/lightning_web/utils.ex	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3528      +/-   ##
==========================================
+ Coverage   89.86%   89.89%   +0.03%     
==========================================
  Files         380      381       +1     
  Lines       15469    15609     +140     
==========================================
+ Hits        13901    14032     +131     
- Misses       1568     1577       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…etry) + tests Introduce with_retry/2 and with_webhook_retry/2 with exponential backoff, optional jitter, and DBConnection.ConnectionError default predicate. Emit telemetry (:lightning, :retry, ...).

…ormalization Implement API.webhook_retry/0,/1 with defaults (attempts, delays, backoff, timeout, jitter) and value clamping. Add tests that delegate via Lightning.MockConfig.

…imeout from retry timeout Load WEBHOOK_RETRY_* into :webhook_retry when present. Set LightningWeb.Endpoint http.protocol_options.idle_timeout to max(60_000, timeout_ms + 15_000). Tests stub Lightning.Config via Mox and assert idle_timeout behaviors.

Add 'Webhook Retry Configuration' to deployment docs and sample WEBHOOK_RETRY_* vars to .env.example with guidance.

…n exhaustion Wrap WorkOrders.create_for with Retry.with_webhook_retry and include context telemetry. On DBConnection.ConnectionError exhaustion, respond 503 with Retry-After (timeout_ms/1000). Update/extend controller tests for success-after-retry and final 503.

midigofrank

Hey @elias-ba , great job, I mostly have questions here.

Did you consider that if max_retries=5 then we could potentially retry 10 times since we're retrying twice?
Now that we're capturing the DBConnection exception, will our monitoring services ever pick it up? (sentry, prometheus ..)
I'm surprised that the WebhookAuth plug was placed after Plug.Parsers all along. Great catch, I wonder if there was a reason it was moved later. Here is the original commit that you did which placed it before Plug.Parsers. Could you please double check as to why this was changed?
I see you've updated the error response payload, I have a feeling this might break workflows else where. I'm okay with it but could you just check with the implementation team to see if they ever match on the error response?
Did you verify that these changes work okay on the billing app?

lib/lightning/config/bootstrap.ex

lib/lightning_web/controllers/webhooks_controller.ex

lib/lightning_web/endpoint.ex

lib/lightning_web/plugs/webhook_auth.ex

rorymckinley

I still need to do the manual tests (and look at some of the tests around bootstrap and config) but I am afraid I have run out of brain today and I don't want to take the chance that GH eats today's work overnight.

I flagged some cases in Lightning.Retry where some transformations of configuration values do not have test coverage (e.g. in next_base_delay). That stuff is tricky to test so you need to decide if it is worth the effort. If you do think so, I would suggest putting all of that sort of stuff into a module of its own - that way you can measure the changes made to config settings by looking directly at the output of the transformation rather than having to try and figure it out from retry behaviour.

rorymckinley · 2025-08-26T11:03:17Z

lib/lightning/retry.ex

+  defp build_config(opts) do
+    merged = Keyword.merge(@default_opts, opts)
+
+    %{


Some of these transformations do not appear to have test coverage, e.g.

rorymckinley · 2025-08-26T11:06:56Z

lib/lightning/retry.ex

+  end
+
+  defp calculate_next_delay(base_delay, %{jitter: true}) when base_delay > 0 do
+    max_jitter = div(base_delay, 4)


The "normalisation" of the jitter does not appear to have test coverage:

rorymckinley · 2025-08-26T11:12:30Z

lib/lightning/retry.ex

+    delay
+    |> Kernel.*(config.backoff_factor)
+    |> trunc()
+    |> min(config.max_delay_ms)


The min and application of config.backoff_factor do not appear to have test coverage.

rorymckinley · 2025-08-26T11:37:46Z

DEPLOYMENT.md

+
+| **Variable**                     | **Description**                                                                                  | **Default** |
+| -------------------------------- | ------------------------------------------------------------------------------------------------ | ----------: |
+| `WEBHOOK_RETRY_MAX_ATTEMPTS`     | Maximum number of attempts (the first attempt runs immediately; backoffs occur between retries). |         `5` |


For the first implementation of this, it feels like we have a lot of knobs that we can fiddle with - my bias is towards wondering if there is space for a simpler implementation that allows us to iterate towards the additional complexity as we see what is required?

rorymckinley · 2025-08-26T11:47:03Z

lib/lightning/config/bootstrap.ex

@@ -487,16 +491,20 @@ defmodule Lightning.Config.Bootstrap do

      url_scheme = env!("URL_SCHEME", :string, "https")

+      retry_timeout_ms = Lightning.Config.webhook_retry(:timeout_ms)


I am trying to decide why this feels weird. I am not sure if you are familiar with the concept of "layering" as suggested by Eric Evans but this feels like it might reversing a pattern, here.

I.e. in my head, I think of the "layers" within how config gets accessed as follows:

Other application code
|_ Lightning.Config
|_ stuff in bootstrap

And by referring to Lightning.Config inside it bootstrap, it feels like we are reversing that 'layering' (quotes because it has been a long time since I read the DDD book so I may be mangling it :). Especially as Lightning.Config overrides the 'default_webhook_retry` values unconditionally?

As a rule of thumb I try to avoid this kind of stuff as it can get me into trouble, but it is subjective.

rorymckinley · 2025-08-26T11:52:04Z

lib/lightning/config.ex

+    @impl true
+    def webhook_retry do
+      default_webhook_retry()
+      |> Keyword.merge(Application.get_env(:lightning, :webhook_retry, []))


Given that this happens unconditionally, could we not just set defaults in bootstrap and do away with default_webhook_retry ? That seems to be a pattern that is quite common?

rorymckinley · 2025-08-26T13:17:26Z

test/lightning_web/plugs/webhook_auth_test.exs

+    Mimic.copy(Lightning.Retry)
+
+    Mimic.expect(Lightning.Retry, :with_webhook_retry, fn _fun, _opts ->
+      {:error, %DBConnection.ConnectionError{message: "db down"}}


Does Mimic allow matching of arguments passed to the method being mocked? If not, I imagine you could perhaps do pattern matching in the function that handles the response?

For me, when I am writing a test where the code is calling another function, the two questions I want answered are:

Does the code I am testing call the method correctly?

Does the code I am testing do the right thing with the method response?

I think we may be missing coverage of the former, so it would be great if we could use Mimic for that.

rorymckinley · 2025-08-26T13:19:53Z

test/lightning_web/plugs/webhook_auth_test.exs

+      |> Repo.preload([:workflow, :edges, :webhook_auth_methods])
+
+    refute conn.halted
+    assert conn.assigns[:trigger] == expected_trigger


In cases where I am testing code that is performing a lookup, I like to have "negative" examples if it is cheap to do so - e.g. at least one other instance of trigger (in this case) that I am not interested to try and offer some reassurance that my code will still work when there is more than one instance of the thing that I am looking for (i.e. the way things would be in a non-test env).

rorymckinley

Hey Jambaar! Round 2 done - more silly questions. I just need to run the manual tests and then I am done :).

rorymckinley · 2025-08-27T05:40:19Z

test/lightning/config/bootstrap_test.exs

@@ -30,7 +33,16 @@ defmodule Lightning.Config.BootstrapTest do

  describe "prod" do
    setup do
-      Process.put(@opts_key, {:prod, ""})
+      Process.put({Config, :opts}, {:prod, ""})


Is there an advantage to using {Config, opts} instead of @opts_key? Or @config_key, @import_key?

rorymckinley · 2025-08-27T05:50:00Z

test/lightning/config/bootstrap_test.exs

@@ -471,4 +530,18 @@ defmodule Lightning.Config.BootstrapTest do
      nil -> nil
    end
  end
+
+  defp reconfigure(envs) do


Nice! Given that it these helpers re only used in a single block of tests, how would you feel about moving the method definitions to be closer to the tests that use them?

Also, is it worth extending this method a little bit so that we can also use it in the setup block?

rorymckinley · 2025-08-27T06:02:54Z

lib/lightning/config/bootstrap.ex

+        timeout_ms: env!("WEBHOOK_RETRY_TIMEOUT_MS", :integer, nil),
+        jitter: env!("WEBHOOK_RETRY_JITTER", &Utils.ensure_boolean/1, nil)
+      ]
+      |> Enum.reject(fn {_, value} -> is_nil(value) end)


Nice - although someone should add a compact method for Elixir! :P

Is there any readability value to code-golfing this to extend the pipeline into a cond to replace the if?

rorymckinley · 2025-08-27T06:18:52Z

lib/lightning/config.ex

+      ]
+    end
+
+    defp normalize_retry(opts) do


There seems to be a lot of overlap between this and Retry.build_config? Is there a case to be made for deferring the normalisation to build_config?

rorymckinley · 2025-08-27T06:19:20Z

test/lightning/retry_test.exs

+      result =
+        Retry.with_retry(
+          fn ->
+            :counters.add(attempts, 1, 1)


TIL - thanks, very cool!

rorymckinley · 2025-08-27T06:24:53Z

lib/lightning/retry.ex

+  end
+
+  @spec retriable_error?(term()) :: boolean()
+  def retriable_error?({:error, %DBConnection.ConnectionError{}}), do: true


This appears to be a duplicate of default_retry_check - what if we made retriable_error? the default for retriable_on, then you would need one less argument in the two current calls?

rorymckinley

Jerejef @elias-ba - manual tests done - all looks good. Great job - none of my comments are showstoppers, so feel free to implement any/none of the suggestions :) - I am going to approve in the meantime so that I do not block you, but happy to do follow-ups if you require!

rorymckinley · 2025-08-27T10:25:25Z

@elias-ba Almost, forgot! This: Telemetry: Lightning.Retry emits :start, :attempt, :stop, :exhausted, :timeout under [:lightning, :retry, ...]. is awesome! Could you point me to the code where there is happening, I would love to surface this in Grafana :)

elias-ba · 2025-08-28T03:33:22Z

Hey @midigofrank I handled most of your change requests / questions. Thanks a lot for your eagle eyes on this. This is really great. Here are few responses to your general questions

Each call to Retry.with_webhook_retry/2 has its own cap. We run two separate retry loops in a request: one for auth lookup in the plug, one for create_workorder in the controller. So the same DB operation is never retried twice; we just have two different operations that can each retry up to max_attempts (default 5). Worst case per request you’ll see up to 5 failed auth lookups and up to 5 failed work-order creations, not 10 for the same step. The DB driver’s own reconnects don’t change that cap; they just influence how a single attempt behaves internally.
Thanks for the catch, you’re right, we weren’t surfacing those because Retry rescues DBConnection.ConnectionError, so PlugCapture never saw them. I’ve pushed a change: we now explicitly capture exhausted retries to Sentry in the 503 path, and emit telemetry for the retry lifecycle plus a webhook.db_unavailable event for PromEx. This restores monitoring without spamming (only on exhaustion) and keeps payloads minimal. cc @rorymckinley
Did a quick check, and this has no implication. We can safely place WebhookAuth before Replug. The switch noted in Replace use of Application.get_env in Endpoint module #1541 was probably accidental.
Nice catch, I restored the "Webhook not found" error
Verified and all good on the billing app, as it should be. This has no implication in anything that would break the billing app

elias-ba · 2025-08-28T03:39:48Z

@elias-ba Almost, forgot! This: Telemetry: Lightning.Retry emits :start, :attempt, :stop, :exhausted, :timeout under [:lightning, :retry, ...]. is awesome! Could you point me to the code where there is happening, I would love to surface this in Grafana :)

Hey @rorymckinley sorry that commit about telemetry never got into this PR, I am happy you asked. You can find all the telemetry events now in this commit

github-project-automation bot added this to v2 Aug 19, 2025

github-project-automation bot moved this to New Issues in v2 Aug 19, 2025

elias-ba changed the title ~~3097 retry webhook~~ Retry webhook on DB errors Aug 19, 2025

elias-ba added 13 commits August 24, 2025 14:52

feat(retry): add generic exponential backoff with jitter (Lightning.R…

bf5553e

…etry) + tests Introduce with_retry/2 and with_webhook_retry/2 with exponential backoff, optional jitter, and DBConnection.ConnectionError default predicate. Emit telemetry (:lightning, :retry, ...).

feat(config): add webhook_retry config surface with sane defaults + n…

59a92c0

…ormalization Implement API.webhook_retry/0,/1 with defaults (attempts, delays, backoff, timeout, jitter) and value clamping. Add tests that delegate via Lightning.MockConfig.

docs(retry): document Webhook Retry settings and envs

18eb020

Add 'Webhook Retry Configuration' to deployment docs and sample WEBHOOK_RETRY_* vars to .env.example with guidance.

Fix dialyzer issues

e90f400

Retry webhook authentication too

eca979d

Fix tests

f20b0c5

More tests for webhooks controller error cases

0f060f9

More tests for webhook auth

9604bf3

Remove dead case clause

6fd5b7b

More tests for the retry module

3a26a36

More tests for the retry module

70f71b2

elias-ba force-pushed the 3097-retry-webhook branch from 91ccd4b to 70f71b2 Compare August 24, 2025 15:09

elias-ba requested review from taylordowns2000 and midigofrank August 24, 2025 15:22

midigofrank requested changes Aug 25, 2025

View reviewed changes

github-project-automation bot moved this from New Issues to In review in v2 Aug 25, 2025

elias-ba requested a review from rorymckinley August 25, 2025 09:10

rorymckinley reviewed Aug 26, 2025

View reviewed changes

rorymckinley reviewed Aug 27, 2025

View reviewed changes

theroinaochieng removed the request for review from taylordowns2000 August 27, 2025 09:11

rorymckinley approved these changes Aug 27, 2025

View reviewed changes

Handle Frank's change requests

00289f3

Remove duplicated setup_storage()

a624833

Fix user_match? for basic auth

dd9687f

elias-ba requested a review from midigofrank August 28, 2025 03:56

midigofrank approved these changes Aug 28, 2025

View reviewed changes

		@@ -487,16 +491,20 @@ defmodule Lightning.Config.Bootstrap do

		url_scheme = env!("URL_SCHEME", :string, "https")

		retry_timeout_ms = Lightning.Config.webhook_retry(:timeout_ms)

Retry webhook on DB errors #3528

Are you sure you want to change the base?

Retry webhook on DB errors #3528

Conversation

elias-ba commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation steps

Additional notes for the reviewer

AI Usage

Pre-submission checklist

Uh oh!

codecov bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

midigofrank left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rorymckinley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rorymckinley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rorymckinley left a comment

Choose a reason for hiding this comment

Uh oh!

rorymckinley commented Aug 27, 2025

Uh oh!

elias-ba commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elias-ba commented Aug 28, 2025

Uh oh!

Uh oh!

elias-ba commented Aug 19, 2025 •

edited

Loading

codecov bot commented Aug 22, 2025 •

edited

Loading

midigofrank left a comment •

edited

Loading

elias-ba commented Aug 28, 2025 •

edited

Loading