Catch Request Errors#225
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThis pull request adds network failure resilience to invite-checking and configuration functions. A new error code constant Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 41 seconds.Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
utils/connect.py (1)
27-27: 💤 Low valueConsider adding timeout to
resend_connect_invitefor consistency.This
requests.postcall has no timeout, which means it could block indefinitely ifconnect.dimagi.comis slow or unreachable. While this isn't in the criticalstart_configurationpath, applying the sameCONNECT_REQUEST_TIMEOUTwould provide consistent resilience across all external calls in this module.♻️ Suggested fix
def resend_connect_invite(user): url = settings.CONNECT_RESEND_INVITES_URL auth = (settings.COMMCARE_CONNECT_CLIENT_ID, settings.COMMCARE_CONNECT_CLIENT_SECRET) data = { "phone_number": user.phone_number.as_e164, "username": user.username, "name": user.name, } - requests.post(url, auth=auth, data=data) + requests.post(url, auth=auth, data=data, timeout=CONNECT_REQUEST_TIMEOUT)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utils/connect.py` at line 27, The requests.post call inside resend_connect_invite currently has no timeout; update the call to pass the module constant CONNECT_REQUEST_TIMEOUT (e.g. requests.post(url, auth=auth, data=data, timeout=CONNECT_REQUEST_TIMEOUT)) so the function uses the same timeout as other external calls in this module (resend_connect_invite and CONNECT_REQUEST_TIMEOUT are the symbols to change/verify).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@utils/connect.py`:
- Line 27: The requests.post call inside resend_connect_invite currently has no
timeout; update the call to pass the module constant CONNECT_REQUEST_TIMEOUT
(e.g. requests.post(url, auth=auth, data=data, timeout=CONNECT_REQUEST_TIMEOUT))
so the function uses the same timeout as other external calls in this module
(resend_connect_invite and CONNECT_REQUEST_TIMEOUT are the symbols to
change/verify).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7a63ea32-e073-450c-96c3-ad79f255f132
📒 Files selected for processing (5)
users/tests/test_views.pyutils/app_integrity/const.pyutils/app_integrity/decorators.pyutils/connect.pyutils/tests/test_connect.py
calellowitz
left a comment
There was a problem hiding this comment.
I think the change related to signup is fine, but I think the toggle change is pretty dangerous
|
(this didn't save before so adding outside the review) I think returning an empty dict on the toggle call is pretty dangerous. If the toggle is meant to define behavior on the mobile (or here in connectid), returning empty toggles could unpredictably alter user behavior from one request to the next, especially if default behavior in the absence of a toggle is different from what was otherwise set for the user. |
Technical Summary
Link to ticket here
This is a release path 1 feature — Improvements to existing features & quick wins.
Fixes the SystemExit cascade behind Sentry issue CONNECT-ID-3F . When the synchronous inter-service call from
/users/start_configurationtoconnect.dimagi.com/users/invited_user/hung, the gunicorn worker was SIGABRT'd viagunicorn.workers.base.handle_abortbeforerequestscould raiseTimeout, which sent CommCare Android an empty/malformed body and crashed the app with NPE. The root cause was a timeout misalignment:requests.get(..., timeout=60)while gunicorn's worker timeout defaults to 30s, so gunicorn always won the race. The fix lowers the per-call timeout below gunicorn's worker timeout sorequests.exceptions.Timeoutactually fires, and converts it (along withConnectionErrorand otherRequestExceptionsubclasses) into a structured 503 JSON response that CommCare can parse.check_number_for_existing_invitesandget_connect_togglesnow use a namedCONNECT_REQUEST_TIMEOUT = 15constant. With gunicorn's 30s worker default, this leaves ~15s of headroom for the rest of the request.CONFIGURATION_TEMPORARILY_UNAVAILABLEerror code inutils/app_integrity/const.py. Surfaced to CommCare with HTTP 503 so the client can show a graceful "configuration unavailable, please try again" message instead of attempting to parse an empty body.require_app_integritywrapper now catchesrequests.exceptions.RequestExceptionfromcheck_number_for_existing_invitesand returns the 503 + structured JSON. We deliberately catchRequestException(the base class, coveringTimeout,ConnectionError,JSONDecodeError, etc.) rather thanException, so genuine programming bugs still surface to Sentry rather than being masked as "upstream unavailable."get_connect_togglesnow catchesRequestExceptionand returns{}. Toggles are non-essential tostart_configurationcompleting; failing the whole request because flag fetch failed would be the wrong tradeoff.Investigated and explicitly chose not to add retry logic. The 13-event Sentry pattern (clustered around 3 dates) is consistent with upstream incidents on
connect.dimagi.com, and (a) commcare-connect already has a Traefik retry middleware (commcare-connect-web-production-retry@docker) in front of its two EC2 backends, so aConnectionErrorreaching us means Traefik already failed both, and (b) the failure mode in the Sentry stack isTimeout(worker SIGABRT), which we wouldn't retry on anyway since retrying a slow upstream amplifies cascade load. If post-deploy data showsConnectionErrorevents not handled by upstream Traefik retry, retry can be added as a follow-up.Logging and monitoring
logger.exception(...), which fans out to Sentry with full stack trace, so post-deploy we can monitor:{}. Watching this for spikes will tell us if upstream is generally degraded or only the invite endpoint is slow./users/start_configuration(via existing infrastructure metrics) is now a meaningful upstream-health signal where it previously surfaced as worker death.Safety Assurance
Safety story
The change is strictly defensive: it converts an existing crash mode (worker SIGABRT, empty body to client) into a structured error response. There is no new request path, no new model field, no DB migration.
The 15s timeout is conservative —
connect.dimagi.com/users/invited_user/does a singleUserInvite.objects.filter(...).exists()query and normally returns in <1s. 15s only fires when the upstream is genuinely degraded, in which case we want to fail fast.RequestExceptionis the precise scope for the catch — it covers every network/timeout/decode failure mode thatrequests.get(...)plusresponse.json()can produce, while leaving programmer bugs (TypeError,AttributeError, etc.) to surface to Sentry as before.For toggles, returning
{}on failure preserves existing semantics in the "no Connect toggles configured" case — the response shape is unchanged. The view does not branch on toggle content.No data is read or written that wasn't read or written before. Reversible by revert.
I am confident that this change will not break current and/or previous versions of CommCare apps
Automated test coverage
users.tests.test_views.TestStartConfigurationView.test_returns_503_when_invite_check_fails— parametrized overrequests.exceptions.Timeoutandrequests.exceptions.ConnectionError, assertsstart_configurationreturns 503 with{"error_code": "CONFIGURATION_TEMPORARILY_UNAVAILABLE"}instead of crashing the worker.utils.tests.test_connect.TestCheckNumberForExistingInvites— happy path returns the parsedinvitedvalue;RequestExceptionpropagates so the decorator can translate it.utils.tests.test_connect.TestGetConnectToggles.test_returns_parsed_toggles_on_success— happy path.utils.tests.test_connect.TestGetConnectToggles.test_returns_empty_dict_when_upstream_fails— parametrized overTimeoutandConnectionError, asserts{}is returned (graceful degradation).TestStartConfigurationViewtests still pass without modification, confirming the structured-503 path is additive rather than replacing the success/integrity paths.QA Plan
QA will not be performed for this change. Below is the testing plan for reference:
connect.dimagi.com/users/invited_user/to be slow or unreachable (e.g., pointCONNECT_INVITED_USER_URLto a non-routable host or a sleep endpoint).POST /users/start_configurationwith a fresh+7426phone number from the CommCare Android PersonalID flow (or viacurl/ Postman with the required integrity headers).{"error_code": "CONFIGURATION_TEMPORARILY_UNAVAILABLE"}— not an empty body, connection reset, or 5xx HTML page.connect-idlog group, no new CONNECT-ID-3F events in Sentry).connect.dimagi.comconnectivity. Re-issue the request and verify it now returns 200 OK.connect.dimagi.com/api/toggles/to be slow or unreachable whileinvited_user/works normally. Verifystart_configurationstill returns 200 OK with"toggles": {}(or just the local Django Switch toggles, no Connect toggles).Failed to reach connect.dimagi.com to check existing invites for ...XXXXXXlog entries with phone-number suffix only (no full PII)."demo_user": true(or the expected payload), unchanged from before.Labels & Review