Skip to content

Catch Request Errors#225

Open
zandre-eng wants to merge 4 commits intomainfrom
ze/resolve-time-out
Open

Catch Request Errors#225
zandre-eng wants to merge 4 commits intomainfrom
ze/resolve-time-out

Conversation

@zandre-eng
Copy link
Copy Markdown
Contributor

@zandre-eng zandre-eng commented Apr 30, 2026

Technical Summary

Link to ticket here

This is a release path 1 feature — Improvements to existing features & quick wins.

Fixes the SystemExit cascade behind Sentry issue CONNECT-ID-3F . When the synchronous inter-service call from /users/start_configuration to connect.dimagi.com/users/invited_user/ hung, the gunicorn worker was SIGABRT'd via gunicorn.workers.base.handle_abort before requests could raise Timeout, which sent CommCare Android an empty/malformed body and crashed the app with NPE. The root cause was a timeout misalignment: requests.get(..., timeout=60) while gunicorn's worker timeout defaults to 30s, so gunicorn always won the race. The fix lowers the per-call timeout below gunicorn's worker timeout so requests.exceptions.Timeout actually fires, and converts it (along with ConnectionError and other RequestException subclasses) into a structured 503 JSON response that CommCare can parse.

  • Per-call timeout lowered to 15s (utils/connect.py) — both check_number_for_existing_invites and get_connect_toggles now use a named CONNECT_REQUEST_TIMEOUT = 15 constant. With gunicorn's 30s worker default, this leaves ~15s of headroom for the rest of the request.
  • New CONFIGURATION_TEMPORARILY_UNAVAILABLE error code in utils/app_integrity/const.py. Surfaced to CommCare with HTTP 503 so the client can show a graceful "configuration unavailable, please try again" message instead of attempting to parse an empty body.
  • Decorator-level error handling (utils/app_integrity/decorators.py) — the require_app_integrity wrapper now catches requests.exceptions.RequestException from check_number_for_existing_invites and returns the 503 + structured JSON. We deliberately catch RequestException (the base class, covering Timeout, ConnectionError, JSONDecodeError, etc.) rather than Exception, so genuine programming bugs still surface to Sentry rather than being masked as "upstream unavailable."
  • Graceful degradation for toggles (utils/connect.py) — get_connect_toggles now catches RequestException and returns {}. Toggles are non-essential to start_configuration completing; failing the whole request because flag fetch failed would be the wrong tradeoff.

Investigated and explicitly chose not to add retry logic. The 13-event Sentry pattern (clustered around 3 dates) is consistent with upstream incidents on connect.dimagi.com, and (a) commcare-connect already has a Traefik retry middleware (commcare-connect-web-production-retry@docker) in front of its two EC2 backends, so a ConnectionError reaching us means Traefik already failed both, and (b) the failure mode in the Sentry stack is Timeout (worker SIGABRT), which we wouldn't retry on anyway since retrying a slow upstream amplifies cascade load. If post-deploy data shows ConnectionError events not handled by upstream Traefik retry, retry can be added as a follow-up.

Logging and monitoring

  • Both new failure paths use logger.exception(...), which fans out to Sentry with full stack trace, so post-deploy we can monitor:
    • CONNECT-ID-3F SystemExit rate — should drop to ~zero. Continued occurrences would indicate a different code path is also exceeding gunicorn's 30s budget.
    • New "Failed to reach connect.dimagi.com to check existing invites" log line — counts the 503s we now return. The phone-number suffix is included (last 6 digits) so we can correlate with specific affected users when needed without logging full PII.
    • New "Failed to fetch toggles from connect.dimagi.com" log line — counts toggle-fetch failures that were silently degraded to {}. Watching this for spikes will tell us if upstream is generally degraded or only the invite endpoint is slow.
  • 503 response rate on /users/start_configuration (via existing infrastructure metrics) is now a meaningful upstream-health signal where it previously surfaced as worker death.

Safety Assurance

Safety story

  • The change is strictly defensive: it converts an existing crash mode (worker SIGABRT, empty body to client) into a structured error response. There is no new request path, no new model field, no DB migration.

  • The 15s timeout is conservative — connect.dimagi.com/users/invited_user/ does a single UserInvite.objects.filter(...).exists() query and normally returns in <1s. 15s only fires when the upstream is genuinely degraded, in which case we want to fail fast.

  • RequestException is the precise scope for the catch — it covers every network/timeout/decode failure mode that requests.get(...) plus response.json() can produce, while leaving programmer bugs (TypeError, AttributeError, etc.) to surface to Sentry as before.

  • For toggles, returning {} on failure preserves existing semantics in the "no Connect toggles configured" case — the response shape is unchanged. The view does not branch on toggle content.

  • No data is read or written that wasn't read or written before. Reversible by revert.

  • I am confident that this change will not break current and/or previous versions of CommCare apps

Automated test coverage

  • users.tests.test_views.TestStartConfigurationView.test_returns_503_when_invite_check_fails — parametrized over requests.exceptions.Timeout and requests.exceptions.ConnectionError, asserts start_configuration returns 503 with {"error_code": "CONFIGURATION_TEMPORARILY_UNAVAILABLE"} instead of crashing the worker.
  • utils.tests.test_connect.TestCheckNumberForExistingInvites — happy path returns the parsed invited value; RequestException propagates so the decorator can translate it.
  • utils.tests.test_connect.TestGetConnectToggles.test_returns_parsed_toggles_on_success — happy path.
  • utils.tests.test_connect.TestGetConnectToggles.test_returns_empty_dict_when_upstream_fails — parametrized over Timeout and ConnectionError, asserts {} is returned (graceful degradation).
  • All existing TestStartConfigurationView tests still pass without modification, confirming the structured-503 path is additive rather than replacing the success/integrity paths.

QA Plan

QA will not be performed for this change. Below is the testing plan for reference:

  • In a staging environment, force connect.dimagi.com/users/invited_user/ to be slow or unreachable (e.g., point CONNECT_INVITED_USER_URL to a non-routable host or a sleep endpoint).
  • Hit POST /users/start_configuration with a fresh +7426 phone number from the CommCare Android PersonalID flow (or via curl / Postman with the required integrity headers).
  • Verify the response is HTTP 503 with body {"error_code": "CONFIGURATION_TEMPORARILY_UNAVAILABLE"}not an empty body, connection reset, or 5xx HTML page.
  • Verify the gunicorn worker remains alive (no SystemExit in CloudWatch connect-id log group, no new CONNECT-ID-3F events in Sentry).
  • Restore connect.dimagi.com connectivity. Re-issue the request and verify it now returns 200 OK.
  • Force connect.dimagi.com/api/toggles/ to be slow or unreachable while invited_user/ works normally. Verify start_configuration still returns 200 OK with "toggles": {} (or just the local Django Switch toggles, no Connect toggles).
  • Verify Sentry receives the new Failed to reach connect.dimagi.com to check existing invites for ...XXXXXX log entries with phone-number suffix only (no full PII).
  • Confirm the existing happy-path: a phone number with an existing Connect invite returns 200 OK with "demo_user": true (or the expected payload), unchanged from before.

Labels & Review

  • The set of people pinged as reviewers is appropriate for the level of risk of the change

Comment thread utils/app_integrity/decorators.py Fixed
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Warning

Rate limit exceeded

@zandre-eng has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 41 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1bb39679-259b-4e9a-ace1-5019d1ea65de

📥 Commits

Reviewing files that changed from the base of the PR and between 3d4d09a and f7b608e.

📒 Files selected for processing (1)
  • utils/app_integrity/decorators.py

Walkthrough

This pull request adds network failure resilience to invite-checking and configuration functions. A new error code constant CONFIGURATION_TEMPORARILY_UNAVAILABLE is introduced. The require_app_integrity decorator now catches network request exceptions during invite verification and returns a 503 response instead of failing. A shared timeout constant is defined in utils/connect.py and applied to both check_number_for_existing_invites and get_connect_toggles. The latter function gains error handling to return an empty dict on network failures. Comprehensive test coverage is added for both success and failure paths across all modified utilities.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding error handling for network/request failures across the codebase.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the problem, solution, implementation details, testing, and safety rationale.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ze/resolve-time-out

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 41 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
utils/connect.py (1)

27-27: 💤 Low value

Consider adding timeout to resend_connect_invite for consistency.

This requests.post call has no timeout, which means it could block indefinitely if connect.dimagi.com is slow or unreachable. While this isn't in the critical start_configuration path, applying the same CONNECT_REQUEST_TIMEOUT would provide consistent resilience across all external calls in this module.

♻️ Suggested fix
 def resend_connect_invite(user):
     url = settings.CONNECT_RESEND_INVITES_URL
     auth = (settings.COMMCARE_CONNECT_CLIENT_ID, settings.COMMCARE_CONNECT_CLIENT_SECRET)
     data = {
         "phone_number": user.phone_number.as_e164,
         "username": user.username,
         "name": user.name,
     }
-    requests.post(url, auth=auth, data=data)
+    requests.post(url, auth=auth, data=data, timeout=CONNECT_REQUEST_TIMEOUT)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utils/connect.py` at line 27, The requests.post call inside
resend_connect_invite currently has no timeout; update the call to pass the
module constant CONNECT_REQUEST_TIMEOUT (e.g. requests.post(url, auth=auth,
data=data, timeout=CONNECT_REQUEST_TIMEOUT)) so the function uses the same
timeout as other external calls in this module (resend_connect_invite and
CONNECT_REQUEST_TIMEOUT are the symbols to change/verify).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@utils/connect.py`:
- Line 27: The requests.post call inside resend_connect_invite currently has no
timeout; update the call to pass the module constant CONNECT_REQUEST_TIMEOUT
(e.g. requests.post(url, auth=auth, data=data, timeout=CONNECT_REQUEST_TIMEOUT))
so the function uses the same timeout as other external calls in this module
(resend_connect_invite and CONNECT_REQUEST_TIMEOUT are the symbols to
change/verify).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a63ea32-e073-450c-96c3-ad79f255f132

📥 Commits

Reviewing files that changed from the base of the PR and between 1d4f968 and 3d4d09a.

📒 Files selected for processing (5)
  • users/tests/test_views.py
  • utils/app_integrity/const.py
  • utils/app_integrity/decorators.py
  • utils/connect.py
  • utils/tests/test_connect.py

Copy link
Copy Markdown
Collaborator

@calellowitz calellowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change related to signup is fine, but I think the toggle change is pretty dangerous

@calellowitz
Copy link
Copy Markdown
Collaborator

(this didn't save before so adding outside the review)

I think returning an empty dict on the toggle call is pretty dangerous. If the toggle is meant to define behavior on the mobile (or here in connectid), returning empty toggles could unpredictably alter user behavior from one request to the next, especially if default behavior in the absence of a toggle is different from what was otherwise set for the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants