Skip to content

Conversation

elias-ba
Copy link
Contributor

@elias-ba elias-ba commented Aug 19, 2025

Description

Adds resilient webhook processing: WebhooksController#create now retries transient database connection errors using a new Lightning.Retry helper with exponential backoff and optional jitter. Retry behavior is configurable via Lightning.Config.webhook_retry (optionally set by WEBHOOK_RETRY_* env vars). If retries are exhausted, the endpoint returns 503 Service Unavailable with a Retry-After header based on the configured timeout_ms.

Closes #3097

Validation steps

Postgres controls I used (macOS packaged Postgres 17):

# start
sudo -u postgres /Library/PostgreSQL/17/bin/pg_ctl \
  -D /Library/PostgreSQL/17/data start -m fast

# stop
sudo -u postgres /Library/PostgreSQL/17/bin/pg_ctl \
  -D /Library/PostgreSQL/17/data stop -m fast
  1. Start app and DB
  • Start Postgres (cmd above)
  • Start the Phoenix app
  1. Create a workflow and copy its webhook URL
    (Any workflow with a webhook trigger is fine.)

  2. Happy path still works

curl -i -X POST <webhook_url> \
  -H 'content-type: application/json' \
  -d '{}'

Expected: 200 OK with body like:

{"work_order_id":"<uuid>"}
  1. Configure retry via .env and restart the app

Create or edit .env in the project root:

WEBHOOK_RETRY_MAX_ATTEMPTS=3
WEBHOOK_RETRY_INITIAL_DELAY_MS=100
WEBHOOK_RETRY_TIMEOUT_MS=5000   # 5s total retry budget → Retry-After=5
WEBHOOK_RETRY_JITTER=false      # optional; keeps timings deterministic

Reload env and start the server (bash/zsh):

env $(cat .env | grep -v "#" | xargs) iex -S mix phx.server
  1. Simulate DB outage

Stop Postgres (cmd above).

  1. POST while DB is down → controller returns 503 + Retry-After
curl -i -X POST <webhook_url> \
  -H 'content-type: application/json' \
  -d '{}'

Expected:

  • Status: 503 Service Unavailable
  • Header: Retry-After: 5
  • Body:
{
  "error": "service_unavailable",
  "message": "Unable to process request due to temporary database issues. Please try again in 5s.",
  "retry_after": 5
}
  1. Plug path also returns 503 when lookup fails (DB still down)

Hit the same endpoint again (or any valid /i/:id):

curl -i -X POST <webhook_url> -H 'content-type: application/json' -d '{}'

Expected (from WebhookAuth plug):

  • Status: 503 Service Unavailable
  • Header: Retry-After: 5
  • Body:
{
  "error": "service_unavailable",
  "message": "Temporary database issue during webhook lookup. Please retry in 5s.",
  "retry_after": 5
}
  1. Recovery: request eventually succeeds if DB comes back before timeout
  • With DB still stopped, run the POST from step 3 in one terminal.
  • Quickly start Postgres in another terminal within 5s (the configured timeout_ms).
  • The in-flight request should complete with 200 and a work_order_id (no 503).
  1. Regression checks (unchanged behavior)
  • GET <webhook_url>200 with “Make a POST request…” message.
  • Send unsupported media type:
curl -i -X POST <webhook_url> \
  -H 'content-type: text/xml' -d '{}'

Expected: 415 Unsupported Media Type with {"error":"Unsupported Media Type"}.

  1. (Optional) Observe logs

Run the POST from step 6 with the DB down and check the app logs. You should see lines like:

  • retry sleeping attempt=... delay_ms=...
  • retry exhausted attempts=...
  • or, on success after a retry: retry succeeded after ... attempts

Additional notes for the reviewer

  • Idle timeout: Default is now max(60_000, retry_timeout_ms + 15_000) to avoid the HTTP connection closing while webhook DB retries are in progress.

  • Error shape (limits): Rate/usage limit responses now include an error code and message.

    • 402 → {"error":"runs_hard_limit","message":"Runs limit exceeded"}
    • 429 → {"error":"too_many_requests","message":"Too many runs in the last minute"}
  • Telemetry: Lightning.Retry emits :start, :attempt, :stop, :exhausted, :timeout under [:lightning, :retry, ...].

  • Docs & envs: DEPLOYMENT.md and .env.example document WEBHOOK_RETRY_*.

  • Backwards compatibility: If no WEBHOOK_RETRY_* envs are set, sensible defaults apply; existing behavior remains unchanged.

AI Usage

Please tick what applies for this PR:

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

Pre-submission checklist

  • I have performed a self-review of my code.
  • I have implemented and tested all related authorization policies (N/A for this change; controller path already guarded).
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR.

@github-project-automation github-project-automation bot moved this to New Issues in v2 Aug 19, 2025
@elias-ba elias-ba changed the title 3097 retry webhook Retry webhook on DB errors Aug 19, 2025
Copy link

codecov bot commented Aug 22, 2025

Codecov Report

❌ Patch coverage is 94.83871% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.89%. Comparing base (ecbe7d0) to head (dd9687f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
lib/lightning/retry.ex 92.04% 7 Missing ⚠️
lib/lightning_web/utils.ex 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3528      +/-   ##
==========================================
+ Coverage   89.86%   89.89%   +0.03%     
==========================================
  Files         380      381       +1     
  Lines       15469    15609     +140     
==========================================
+ Hits        13901    14032     +131     
- Misses       1568     1577       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…etry) + tests

Introduce with_retry/2 and with_webhook_retry/2 with exponential backoff, optional jitter, and DBConnection.ConnectionError default predicate. Emit telemetry (:lightning, :retry, ...).
…ormalization

Implement API.webhook_retry/0,/1 with defaults (attempts, delays, backoff, timeout, jitter) and value clamping. Add tests that delegate via Lightning.MockConfig.
…imeout from retry timeout

Load WEBHOOK_RETRY_* into :webhook_retry when present. Set LightningWeb.Endpoint http.protocol_options.idle_timeout to max(60_000, timeout_ms + 15_000). Tests stub Lightning.Config via Mox and assert idle_timeout behaviors.
Add 'Webhook Retry Configuration' to deployment docs and sample WEBHOOK_RETRY_* vars to .env.example with guidance.
…n exhaustion

Wrap WorkOrders.create_for with Retry.with_webhook_retry and include context telemetry. On DBConnection.ConnectionError exhaustion, respond 503 with Retry-After (timeout_ms/1000). Update/extend controller tests for success-after-retry and final 503.
Copy link
Collaborator

@midigofrank midigofrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @elias-ba , great job, I mostly have questions here.

  1. Did you consider that if max_retries=5 then we could potentially retry 10 times since we're retrying twice?
  2. Now that we're capturing the DBConnection exception, will our monitoring services ever pick it up? (sentry, prometheus ..)
  3. I'm surprised that the WebhookAuth plug was placed after Plug.Parsers all along. Great catch, I wonder if there was a reason it was moved later. Here is the original commit that you did which placed it before Plug.Parsers. Could you please double check as to why this was changed?
  4. I see you've updated the error response payload, I have a feeling this might break workflows else where. I'm okay with it but could you just check with the implementation team to see if they ever match on the error response?
  5. Did you verify that these changes work okay on the billing app?

@github-project-automation github-project-automation bot moved this from New Issues to In review in v2 Aug 25, 2025
@elias-ba elias-ba requested a review from rorymckinley August 25, 2025 09:10
Copy link
Collaborator

@rorymckinley rorymckinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do the manual tests (and look at some of the tests around bootstrap and config) but I am afraid I have run out of brain today and I don't want to take the chance that GH eats today's work overnight.

I flagged some cases in Lightning.Retry where some transformations of configuration values do not have test coverage (e.g. in next_base_delay). That stuff is tricky to test so you need to decide if it is worth the effort. If you do think so, I would suggest putting all of that sort of stuff into a module of its own - that way you can measure the changes made to config settings by looking directly at the output of the transformation rather than having to try and figure it out from retry behaviour.

defp build_config(opts) do
merged = Keyword.merge(@default_opts, opts)

%{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these transformations do not appear to have test coverage, e.g.

image

end

defp calculate_next_delay(base_delay, %{jitter: true}) when base_delay > 0 do
max_jitter = div(base_delay, 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "normalisation" of the jitter does not appear to have test coverage:

image

delay
|> Kernel.*(config.backoff_factor)
|> trunc()
|> min(config.max_delay_ms)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min and application of config.backoff_factor do not appear to have test coverage.

image image


| **Variable** | **Description** | **Default** |
| -------------------------------- | ------------------------------------------------------------------------------------------------ | ----------: |
| `WEBHOOK_RETRY_MAX_ATTEMPTS` | Maximum number of attempts (the first attempt runs immediately; backoffs occur between retries). | `5` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first implementation of this, it feels like we have a lot of knobs that we can fiddle with - my bias is towards wondering if there is space for a simpler implementation that allows us to iterate towards the additional complexity as we see what is required?

@@ -487,16 +491,20 @@ defmodule Lightning.Config.Bootstrap do

url_scheme = env!("URL_SCHEME", :string, "https")

retry_timeout_ms = Lightning.Config.webhook_retry(:timeout_ms)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to decide why this feels weird. I am not sure if you are familiar with the concept of "layering" as suggested by Eric Evans but this feels like it might reversing a pattern, here.

I.e. in my head, I think of the "layers" within how config gets accessed as follows:

Other application code
|_ Lightning.Config
|_ stuff in bootstrap

And by referring to Lightning.Config inside it bootstrap, it feels like we are reversing that 'layering' (quotes because it has been a long time since I read the DDD book so I may be mangling it :). Especially as Lightning.Config overrides the 'default_webhook_retry` values unconditionally?

As a rule of thumb I try to avoid this kind of stuff as it can get me into trouble, but it is subjective.

@impl true
def webhook_retry do
default_webhook_retry()
|> Keyword.merge(Application.get_env(:lightning, :webhook_retry, []))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this happens unconditionally, could we not just set defaults in bootstrap and do away with default_webhook_retry ? That seems to be a pattern that is quite common?

Mimic.copy(Lightning.Retry)

Mimic.expect(Lightning.Retry, :with_webhook_retry, fn _fun, _opts ->
{:error, %DBConnection.ConnectionError{message: "db down"}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Mimic allow matching of arguments passed to the method being mocked? If not, I imagine you could perhaps do pattern matching in the function that handles the response?

For me, when I am writing a test where the code is calling another function, the two questions I want answered are:

  • Does the code I am testing call the method correctly?
  • Does the code I am testing do the right thing with the method response?

I think we may be missing coverage of the former, so it would be great if we could use Mimic for that.

|> Repo.preload([:workflow, :edges, :webhook_auth_methods])

refute conn.halted
assert conn.assigns[:trigger] == expected_trigger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where I am testing code that is performing a lookup, I like to have "negative" examples if it is cheap to do so - e.g. at least one other instance of trigger (in this case) that I am not interested to try and offer some reassurance that my code will still work when there is more than one instance of the thing that I am looking for (i.e. the way things would be in a non-test env).

Copy link
Collaborator

@rorymckinley rorymckinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jambaar! Round 2 done - more silly questions. I just need to run the manual tests and then I am done :).

@@ -30,7 +33,16 @@ defmodule Lightning.Config.BootstrapTest do

describe "prod" do
setup do
Process.put(@opts_key, {:prod, ""})
Process.put({Config, :opts}, {:prod, ""})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to using {Config, opts} instead of @opts_key? Or @config_key, @import_key?

@@ -471,4 +530,18 @@ defmodule Lightning.Config.BootstrapTest do
nil -> nil
end
end

defp reconfigure(envs) do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Given that it these helpers re only used in a single block of tests, how would you feel about moving the method definitions to be closer to the tests that use them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it worth extending this method a little bit so that we can also use it in the setup block?

timeout_ms: env!("WEBHOOK_RETRY_TIMEOUT_MS", :integer, nil),
jitter: env!("WEBHOOK_RETRY_JITTER", &Utils.ensure_boolean/1, nil)
]
|> Enum.reject(fn {_, value} -> is_nil(value) end)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - although someone should add a compact method for Elixir! :P

Is there any readability value to code-golfing this to extend the pipeline into a cond to replace the if?

]
end

defp normalize_retry(opts) do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of overlap between this and Retry.build_config? Is there a case to be made for deferring the normalisation to build_config?

result =
Retry.with_retry(
fn ->
:counters.add(attempts, 1, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL - thanks, very cool!

end

@spec retriable_error?(term()) :: boolean()
def retriable_error?({:error, %DBConnection.ConnectionError{}}), do: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a duplicate of default_retry_check - what if we made retriable_error? the default for retriable_on, then you would need one less argument in the two current calls?

@theroinaochieng theroinaochieng removed the request for review from taylordowns2000 August 27, 2025 09:11
Copy link
Collaborator

@rorymckinley rorymckinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jerejef @elias-ba - manual tests done - all looks good. Great job - none of my comments are showstoppers, so feel free to implement any/none of the suggestions :) - I am going to approve in the meantime so that I do not block you, but happy to do follow-ups if you require!

@rorymckinley
Copy link
Collaborator

@elias-ba Almost, forgot! This: Telemetry: Lightning.Retry emits :start, :attempt, :stop, :exhausted, :timeout under [:lightning, :retry, ...]. is awesome! Could you point me to the code where there is happening, I would love to surface this in Grafana :)

@elias-ba
Copy link
Contributor Author

elias-ba commented Aug 28, 2025

Hey @midigofrank I handled most of your change requests / questions. Thanks a lot for your eagle eyes on this. This is really great. Here are few responses to your general questions

  1. Each call to Retry.with_webhook_retry/2 has its own cap. We run two separate retry loops in a request: one for auth lookup in the plug, one for create_workorder in the controller. So the same DB operation is never retried twice; we just have two different operations that can each retry up to max_attempts (default 5). Worst case per request you’ll see up to 5 failed auth lookups and up to 5 failed work-order creations, not 10 for the same step. The DB driver’s own reconnects don’t change that cap; they just influence how a single attempt behaves internally.

  2. Thanks for the catch, you’re right, we weren’t surfacing those because Retry rescues DBConnection.ConnectionError, so PlugCapture never saw them. I’ve pushed a change: we now explicitly capture exhausted retries to Sentry in the 503 path, and emit telemetry for the retry lifecycle plus a webhook.db_unavailable event for PromEx. This restores monitoring without spamming (only on exhaustion) and keeps payloads minimal. cc @rorymckinley

  3. Did a quick check, and this has no implication. We can safely place WebhookAuth before Replug. The switch noted in Replace use of Application.get_env in Endpoint module #1541 was probably accidental.

  4. Nice catch, I restored the "Webhook not found" error

  5. Verified and all good on the billing app, as it should be. This has no implication in anything that would break the billing app

@elias-ba
Copy link
Contributor Author

@elias-ba Almost, forgot! This: Telemetry: Lightning.Retry emits :start, :attempt, :stop, :exhausted, :timeout under [:lightning, :retry, ...]. is awesome! Could you point me to the code where there is happening, I would love to surface this in Grafana :)

Hey @rorymckinley sorry that commit about telemetry never got into this PR, I am happy you asked. You can find all the telemetry events now in this commit

@elias-ba elias-ba requested a review from midigofrank August 28, 2025 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Retry DB inserts on inbox endpoint
3 participants