Skip to content

fix: add --forwarded-allow-ips to gunicorn so Django sees HTTPS scheme#1109

Open
jjackson wants to merge 1 commit intodimagi:mainfrom
jjackson:fix/gunicorn-forwarded-allow-ips
Open

fix: add --forwarded-allow-ips to gunicorn so Django sees HTTPS scheme#1109
jjackson wants to merge 1 commit intodimagi:mainfrom
jjackson:fix/gunicorn-forwarded-allow-ips

Conversation

@jjackson
Copy link
Copy Markdown
Member

@jjackson jjackson commented Apr 8, 2026

Summary

One-line fix: add --forwarded-allow-ips=\"*\" to the gunicorn command in docker/start so Django's request.scheme correctly reflects HTTPS when the request comes in through Traefik.

The bug

Gunicorn defaults --forwarded-allow-ips to 127.0.0.1, meaning it only trusts X-Forwarded-* headers from loopback. In our containerized Kamal + Traefik deployment, Traefik runs in a separate container so its source IP is NOT 127.0.0.1, and gunicorn silently strips the X-Forwarded-Proto: https header before handing the request to Django.

Downstream effects inside Django:

  • request.scheme returns 'http' even though the client request was HTTPS
  • request.is_secure() returns False
  • request.build_absolute_uri() emits URLs with http:// scheme
  • Any code path that constructs canonical URLs from the request is broken

How we found it

During the CommCare Connect Labs v2 paginated-JSON export migration (replacing the v1 streaming CSV exports), we ran an end-to-end parity check against connect.dimagi.com and hit the bug immediately. IdKeysetPagination.get_next_link() in commcare_connect/data_export/pagination.py uses request.build_absolute_uri() to construct the next URL, and production was returning paginated responses with http:// in the next field:

import httpx
r = httpx.get(
    'https://connect.dimagi.com/export/opportunity/<id>/user_visits/',
    headers={
        'Authorization': 'Bearer ...',
        'Accept': 'application/json; version=2.0',
    },
)
# Status: 200
# Response body:
#   {
#     \"next\": \"http://connect.dimagi.com/export/opportunity/<id>/user_visits/?last_id=...\",
#     \"results\": [...]
#   }
#            ^^^^ should be https://

The server accepted the HTTPS request correctly but built the next URL with http:// because Django thought the original request was HTTP. Clients following the link hit the edge's canonical HTTPS redirect and received a 301, which broke any HTTP client that didn't have follow_redirects=True (httpx's default is false).

The Labs client worked around it with redirect following, but the root cause is here — and this affects any code path that uses build_absolute_uri() or is_secure(), not just pagination. Latent examples I spotted while investigating:

  • Password-reset / email-verification URLs
  • Webhook callback URL construction
  • Any view that serializes absolute media URLs
  • The SECURE_SSL_REDIRECT middleware (currently defaults to False in staging.py, but would infinite-loop if ever flipped to True given the current gunicorn config)

The fix

--- a/docker/start
+++ b/docker/start
@@ -6,4 +6,4 @@ set -o nounset
 
 
-gunicorn config.wsgi --bind 0.0.0.0:8000 --chdir=/app --timeout 450 -w 10
+gunicorn config.wsgi --bind 0.0.0.0:8000 --chdir=/app --timeout 450 -w 10 --forwarded-allow-ips=\"*\"

Tell gunicorn to trust X-Forwarded-* headers from any source.

Why this is safe

Gunicorn binds to 0.0.0.0:8000 inside the Docker network, where only Traefik can reach it. There is no path for an untrusted source to inject forged forwarded headers — an external attacker would need to first compromise the container network, at which point forwarded-header spoofing is the least of your concerns.

This is also the documented pattern for gunicorn-behind-a-reverse-proxy setups: https://docs.gunicorn.org/en/stable/settings.html#forwarded-allow-ips

Alternative: specify the exact Traefik container's CIDR (e.g., 172.17.0.0/16 for the default Docker bridge network). That's slightly more defensive but brittle — any change to the Docker network layout breaks it. The \"*\" pattern is the standard recommendation for containerized reverse-proxy setups.

Test plan

After this lands and deploys:

  • Hit any paginated v2 endpoint:

    GET https://connect.dimagi.com/api/data/export/opportunity/<id>/user_visits/
    Accept: application/json; version=2.0
    Authorization: Bearer <token>
    
  • Verify the next field in the response body starts with https://, not http://

  • Follow the next URL in a second request and confirm it returns 200 directly (not 301)

  • Spot-check any existing feature that uses build_absolute_uri() to construct URLs (e.g., email templates, webhook URLs) — these should now generate https:// URLs

Related

  • Found during labs v2 export migration work (jjackson/connect-labs#55)
  • The Labs ExportAPIClient uses follow_redirects=True as a defensive workaround; that can stay in place even after this upstream fix lands, but will become a no-op once next URLs are correct.

🤖 Generated with Claude Code

Gunicorn defaults `--forwarded-allow-ips` to `127.0.0.1`, meaning it only
trusts `X-Forwarded-*` headers from loopback. In the containerized Kamal +
Traefik deployment, Traefik runs in a separate container so its source IP
is NOT 127.0.0.1, and gunicorn silently strips the `X-Forwarded-Proto: https`
header before handing the request to Django.

Downstream effects:
- `request.scheme` returns 'http' even though the client request was HTTPS
- `request.is_secure()` returns False
- `request.build_absolute_uri()` emits URLs with `http://` scheme
- Any code path that constructs canonical URLs from the request is broken

This manifested during the Labs v2 paginated-JSON export rollout: the
`IdKeysetPagination.get_next_link()` method uses `request.build_absolute_uri()`
to construct the `next` URL, and was emitting `http://connect.dimagi.com/...`
in paginated responses. Clients following the link hit the edge's canonical
HTTPS redirect and received a 301, which broke any HTTP client that didn't
have `follow_redirects=True`. The Labs client worked around it with redirect
following, but the root cause is here.

This is also latent for any future feature on the Connect side that relies
on `build_absolute_uri()`, `is_secure()`, or any middleware that consumes
the scheme (e.g., generating password-reset emails with absolute URLs,
building webhook callback URLs, or any view that needs a canonical URL to
render in a response).

The fix is to tell gunicorn to trust `X-Forwarded-*` headers from any
source. This is safe because gunicorn binds to `0.0.0.0:8000` inside a
Docker network where only Traefik can reach it — there is no path for an
untrusted source to inject forwarded headers. This is also the standard
pattern for gunicorn-behind-a-reverse-proxy setups.

Reference: https://docs.gunicorn.org/en/stable/settings.html#forwarded-allow-ips

## Evidence of the bug

A direct reproduction against production `connect.dimagi.com` with the v2
JSON export API:

```python
import httpx
r = httpx.get(
    'https://connect.dimagi.com/export/opportunity/<id>/user_visits/',
    headers={'Authorization': 'Bearer ...', 'Accept': 'application/json; version=2.0'},
)
# Status: 200
# Response body:
#   {"next": "http://connect.dimagi.com/export/opportunity/<id>/user_visits/?last_id=...", ...}
#                 ^^^^ should be https
```

The server accepted the HTTPS request correctly but built the `next` URL
with `http://` because Django thought the original request was HTTP.

## Verification steps

After this change is deployed:

1. Request any paginated endpoint:
   `GET https://connect.dimagi.com/export/opportunity/<id>/user_visits/`
   with `Accept: application/json; version=2.0` header.

2. Verify the `next` field in the response body starts with `https://`,
   not `http://`.

3. Follow the `next` URL in a second request and confirm it returns 200
   directly (not 301).

4. (Optional) Confirm `request.scheme == 'https'` in a Django view by
   adding a temporary logging statement, or via the Django admin.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69386eb4-fa40-46d1-b50c-ac0b0a953c88

📥 Commits

Reviewing files that changed from the base of the PR and between dbe7be7 and ff08698.

📒 Files selected for processing (1)
  • docker/start

Walkthrough

The Docker start script was modified to add the --forwarded-allow-ips="*" flag to the Gunicorn command. This configuration change permits Gunicorn to accept client IP information from X-Forwarded-* headers from any source. The remaining Gunicorn settings—binding address, working directory, worker count, and timeout—remain unchanged.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding --forwarded-allow-ips to gunicorn to fix Django's HTTPS scheme detection.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the bug, its impact, the fix, safety rationale, and test plan.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

jjackson added a commit to jjackson/scout that referenced this pull request Apr 8, 2026
…ated JSON

CommCare Connect deprecated its streaming CSV export endpoints in favor of
keyset-paginated JSON (Accept: application/json; version=2.0) to avoid the
5-minute gunicorn worker timeout on large UserVisit exports. Scout's seven
list-export loaders now follow the v2 contract: GET, then walk the `next`
URL until null.

Changes
- ConnectBaseLoader: add `_paginate_export_pages()` (yields page lists,
  follows `next`, sends versioned Accept header per-call so the metadata
  loader's non-versioned endpoints stay untouched), plus `stringify` /
  `stringify_record` helpers and `ConnectExportError`. Drop `_get_csv` and
  the csv/io imports.
- ConnectVisitLoader: rewrite `_normalize_visit` for v2 dict input, drop
  `ast.literal_eval` Python-repr fallback (form_json is now a real dict),
  preserve the `id` → `visit_id` rename, fall back to the loader's
  opportunity_id when the serializer omits it.
- 6 simple loaders (users, completed_works, payments, invoices,
  assessments, completed_modules): thin paginate-then-stringify loops.
  Now stream page-by-page rather than buffering the full export.

The downstream `_write_connect_*` writers in materializer.py are unchanged
— scalars are stringified to preserve the v1 CSV-shaped TEXT-column
contract; `form_json` and `images` pass through as native dict/list for
the JSONB columns.

Regression pin for dimagi/commcare-connect#1109
- Production has been observed returning `next` URLs with the `http://`
  scheme even on HTTPS requests (gunicorn `--forwarded-allow-ips` defaults
  to 127.0.0.1, strips `X-Forwarded-Proto`). Until #1109 lands, scout
  relies on requests' default `allow_redirects=True` to follow the edge's
  301 → https, and on `Session.should_strip_auth`'s same-host HTTP→HTTPS
  upgrade exception to preserve the bearer token across the redirect.
- Inline comment in `_paginate_export_pages` documents the dependency.
- New test `test_follows_http_to_https_redirect_on_next_url` simulates the
  bug end-to-end and asserts both row aggregation and bearer-token
  preservation.

Tests
- 56 loader tests pass (12 base + 13 visit + 31 simple-loader matrix).
- Full repo: 797 passed, 15 skipped, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
snopoke pushed a commit to dimagi-rad/scout that referenced this pull request Apr 11, 2026
* feat: add BASE_PATH config and Vite base path support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add BASE_PATH basename to routers and route detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add popup OAuth flow for embed iframe authentication

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add iframe auto-resize via ResizeObserver and postMessage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add BASE_PATH and nginx config args to frontend Dockerfile

Add VITE_BASE_PATH and NGINX_CONF build args to Dockerfile.frontend
so connect-labs can build with /scout/ prefix. Add nginx.prod.conf
with /scout/ prefix stripping for the production deploy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: skip allauth OAuth confirmation page

Set SOCIALACCOUNT_LOGIN_ON_GET = True so clicking an OAuth button
goes straight to the provider instead of showing an unstyled
intermediate confirmation page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: skip email requirement for OAuth signups

CommCare Connect's API doesn't return an email, causing allauth to
show an unstyled signup form asking for one. Set
SOCIALACCOUNT_EMAIL_REQUIRED = False so OAuth users are auto-created
without needing an email address.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add connectlabs Django settings module

Required by the ECS task definition which sets
DJANGO_SETTINGS_MODULE=config.settings.connectlabs.
Inherits production settings with ALB-specific overrides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prefix all API fetch calls with BASE_PATH

The API client, chat transport, public pages, and onboarding wizard
all used bare /api/ and /accounts/ paths. Under a subpath deploy
like /scout/, these 404 because nginx only proxies /scout/api/ to
the backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prefix OAuth login URLs with BASE_PATH

The providers API returns bare /accounts/... URLs which 404 under
a subpath deploy. Prefix both the href and the next redirect with
BASE_PATH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add labs smoke test and embed enhancements plan

- Playwright smoke test for labs.connect.dimagi.com end-to-end flow
- Implementation plan for embed enhancements (reference doc)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR #95 review comments

- Replace RegExp-based BASE_PATH stripping with startsWith/slice to
  avoid regex metacharacter bugs (e.g. dots in /scout.v2)
- Revert SOCIALACCOUNT_LOGIN_ON_GET to False to prevent Login CSRF
  (the popup OAuth flow doesn't need it)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: allow nullable email for OAuth users without email addresses

CommCare Connect users may not have an email. The User model had
email as unique+required, so the second email-less user caused a
UniqueViolation on the empty string. Now email is nullable — empty
values are stored as NULL (PostgreSQL allows multiple NULLs in a
unique column). Includes a data migration to convert existing empty
emails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: re-enable SOCIALACCOUNT_LOGIN_ON_GET for popup OAuth flow

The popup is user-initiated so Login CSRF isn't a concern. With this
set to False, users see an unstyled allauth confirmation page inside
the popup before being redirected to the provider.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove window.opener check for OAuth popup close

Cross-origin OAuth redirects (Scout → CommCare Connect → Scout) can
clear window.opener in modern browsers, preventing the popup from
auto-closing after login.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prefix popup OAuth URL with BASE_PATH

The popup OAuth URL was missing the BASE_PATH prefix (/scout),
so the popup navigated to the wrong path and fell through to
ConnectLabs instead of Scout's allauth login endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: close OAuth popup instantly via inline script in index.html

The popup was loading the full React app before the popup_close check
ran, causing a visible flash. An inline script in index.html runs
before any CSS/JS loads and closes the popup immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use window.name to detect OAuth popup instead of query param

The allauth OAuth flow drops the `next` query param, so
`popup_close=1` never reaches the final redirect URL. Instead,
detect the popup by its window name ("scout-oauth") which persists
across all navigations including cross-origin OAuth redirects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: handle OAuth popup in widget SDK instead of delegating to host

The widget SDK now intercepts scout:auth-required and opens the OAuth
popup directly to the provider login URL with window.name="scout-oauth".
This bypasses the host app's auth overlay which was opening a popup to
the standalone Scout app instead of the OAuth endpoint. After the popup
closes, the iframe reloads to pick up the authenticated session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: optimize backend Dockerfile for code-only deploys + widget SDK OAuth popup

Dockerfile: Separate dependency install from code copy so Docker layer
cache skips the slow uv pip install step when only code changes (not
pyproject.toml/uv.lock). This mirrors the frontend Dockerfile pattern.

widget.js: Handle scout:auth-required directly in the SDK by opening
the OAuth popup to the provider login URL with window.name="scout-oauth".
Previously the host app (ConnectLabs) was opening a popup to standalone
Scout instead of the OAuth endpoint, bypassing all popup close logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: handle OAuth popup in widget SDK instead of delegating to host

Stop forwarding scout:auth-required to the host app (ConnectLabs).
The host was showing its own auth overlay that hid Scout's LoginForm,
then opening a popup to standalone Scout instead of the OAuth endpoint.

Now the widget SDK suppresses the event so Scout's own LoginForm stays
visible in the iframe. The LoginForm's popup OAuth flow handles auth
correctly — opening directly to the provider with window.name="scout-oauth"
and auto-closing via the inline script in index.html.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: set LOGIN_REDIRECT_URL to /scout/ in connectlabs settings

Django doesn't prepend FORCE_SCRIPT_NAME to LOGIN_REDIRECT_URL, so
after OAuth the popup was redirecting to / (ConnectLabs) instead of
/scout/ (Scout). Now it lands on Scout's index.html where the inline
script detects window.name="scout-oauth" and closes the popup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use cookie instead of window.name to detect OAuth popup

CommCare Connect's COOP headers clear window.name during cross-origin
OAuth redirects. Use a cookie instead — set before opening the popup,
checked in index.html when the popup returns to Scout after OAuth.
The cookie is cleared immediately after detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore popup_close query param check alongside cookie check

Allauth DOES preserve the next param through the OAuth flow, so the
popup lands at /scout/embed/?popup_close=1. Check both the cookie
and the query param to close the popup — the query param works when
allauth preserves next, the cookie is a fallback for when it doesn't.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prevent Login CSRF by keeping SOCIALACCOUNT_LOGIN_ON_GET=False

Revert LOGIN_ON_GET to False to prevent Login CSRF attacks (attacker
can craft a link that initiates OAuth and logs victim into attacker's
account). Instead, the popup auto-submits the allauth confirmation
form via JS — the form has a CSRF token so it's protected, and the
user doesn't see the "Continue" button.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: simplify OAuth popup to match original proven approach

Reverts LOGIN_ON_GET to False (prevents Login CSRF) and simplifies
the popup flow back to the original pattern that worked reliably:

1. Embed LoginForm opens popup to standalone Scout (/scout/)
2. User sees normal Scout login form in popup, clicks OAuth provider
3. OAuth completes, popup loads authenticated Scout
4. App.tsx detects cookie + authenticated state -> window.close()
5. Iframe polls for popup close -> fetchMe() -> authenticated

Removes: inline index.html scripts, auto-form-submit hacks,
popup_close query params, window.name detection. Uses a simple
cookie to identify popup windows instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: re-enable SOCIALACCOUNT_LOGIN_ON_GET to skip allauth confirmation

The allauth confirmation page ("Sign In Via CommCare Connect" + Continue
button) adds no value — the user already clicked an OAuth button, and
the OAuth provider has its own authorize screen. Keeping it off caused
the popup to get stuck on an unstyled intermediate page.

Login CSRF risk is mitigated by the provider's own authorization screen
which requires explicit user consent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: always return arrays from artifact live queries and strengthen source_queries prompt (#1)

Two bugs found during artifact walkthrough testing:

1. Single-row query results caused `rows.map is not a function` because
   mergeQueryResults() converted single-row results to objects. Components
   always expect arrays. Now always returns arrays regardless of row count.

2. Agent sometimes embedded static data instead of using source_queries,
   breaking the "always-fresh data" contract. Strengthened the artifact
   prompt: CRITICAL language, removed the "< 5 rows" exception, added
   explicit "NEVER embed" instruction, updated docs to reflect arrays-only.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: embed sidebar navigation 404 — prefix routes with /embed

Sidebar links (/, /knowledge, /artifacts, etc.) didn't account for the
embed router's /embed prefix, causing React Router 404 on every click
in the ConnectLabs iframe. Detect embed mode via useEmbedParams and
prefix all navigation paths accordingly. Also add missing embed routes
(data-dictionary, settings/connections) and a catch-all redirect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: inject user_id into MCP tool calls to prevent cross-user credential use

When multiple users share a workspace, run_materialization could pick up
a different user's OAuth token. Now user_id is injected server-side (same
as workspace_id) and used to filter the TenantMembership query, ensuring
each user's MCP calls use only their own credentials.

Addresses review feedback from @snopoke on PR #107.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: replace embed OAuth popup with direct OAuth + token relay

The embedded login had three compounding problems: the popup opened
Scout's root (showing the login form twice), no feedback while the
popup was open, and third-party cookie blocking prevented the session
from reaching the iframe.

Replace the cookie-based popup detection approach with the standard
embedded OAuth pattern: popup goes directly to the OAuth provider,
a lightweight callback page sends a signed one-time token via
postMessage, and the iframe exchanges it for a session with
SameSite=None cookies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: popup-complete redirect fallback and spinner persistence

Two fixes for the embed OAuth flow:

1. When allauth doesn't preserve the `next` URL through OAuth and the
   popup loads the full app, App.tsx now detects the same-origin opener
   and redirects to /auth/popup-complete/ automatically.

2. fetchMe() no longer sets authStatus to "loading" during re-checks
   (only on initial idle state). This prevents EmbedPage from unmounting
   LoginForm during visibilitychange, which was losing the popup spinner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: simplify embed OAuth to direct navigation (no popup)

The popup approach was unnecessary — Scout and connect-labs are on the
same domain, so there are no cross-site cookie issues. Replace the
entire popup/postMessage/token-exchange mechanism with a simple
target="_top" link that navigates the full page through OAuth, then
returns to the connect-labs embed page.

This removes ~290 lines: popup-complete view, token-exchange endpoint,
postMessage handling, window.opener detection, spinner state management.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove double BASE_PATH prefix on OAuth login URLs

The providers API was prepending FORCE_SCRIPT_NAME to login_url, but
the frontend already prepends BASE_PATH. This produced /scout/scout/...
URLs that hit the SPA fallback instead of reaching Django's allauth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(loaders): migrate Connect exports from streaming CSV to v2 paginated JSON

CommCare Connect deprecated its streaming CSV export endpoints in favor of
keyset-paginated JSON (Accept: application/json; version=2.0) to avoid the
5-minute gunicorn worker timeout on large UserVisit exports. Scout's seven
list-export loaders now follow the v2 contract: GET, then walk the `next`
URL until null.

Changes
- ConnectBaseLoader: add `_paginate_export_pages()` (yields page lists,
  follows `next`, sends versioned Accept header per-call so the metadata
  loader's non-versioned endpoints stay untouched), plus `stringify` /
  `stringify_record` helpers and `ConnectExportError`. Drop `_get_csv` and
  the csv/io imports.
- ConnectVisitLoader: rewrite `_normalize_visit` for v2 dict input, drop
  `ast.literal_eval` Python-repr fallback (form_json is now a real dict),
  preserve the `id` → `visit_id` rename, fall back to the loader's
  opportunity_id when the serializer omits it.
- 6 simple loaders (users, completed_works, payments, invoices,
  assessments, completed_modules): thin paginate-then-stringify loops.
  Now stream page-by-page rather than buffering the full export.

The downstream `_write_connect_*` writers in materializer.py are unchanged
— scalars are stringified to preserve the v1 CSV-shaped TEXT-column
contract; `form_json` and `images` pass through as native dict/list for
the JSONB columns.

Regression pin for dimagi/commcare-connect#1109
- Production has been observed returning `next` URLs with the `http://`
  scheme even on HTTPS requests (gunicorn `--forwarded-allow-ips` defaults
  to 127.0.0.1, strips `X-Forwarded-Proto`). Until #1109 lands, scout
  relies on requests' default `allow_redirects=True` to follow the edge's
  301 → https, and on `Session.should_strip_auth`'s same-host HTTP→HTTPS
  upgrade exception to preserve the bearer token across the redirect.
- Inline comment in `_paginate_export_pages` documents the dependency.
- New test `test_follows_http_to_https_redirect_on_next_url` simulates the
  bug end-to-end and asserts both row aggregation and bearer-token
  preservation.

Tests
- 56 loader tests pass (12 base + 13 visit + 31 simple-loader matrix).
- Full repo: 797 passed, 15 skipped, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(nginx): raise proxy_*_timeout to 600s on /scout/api/ for long materializations

Symptom: "network error" in the chat UI when running run_materialization
on a ~70k-row opportunity (tenant 765). CloudWatch
/ecs/labs-jj-scout-web 2026-04-08 10:30:35 UTC:

  [error] *3796 upstream timed out (110: Operation timed out)
  while reading upstream, request: "POST /scout/api/chat/",
  upstream: "http://127.0.0.1:8000/api/chat/"

Root cause: The /scout/api/ nginx location in frontend/nginx.prod.conf
sets proxy_buffering off and chunked_transfer_encoding on (correct for
SSE) but does not set proxy_read_timeout or proxy_send_timeout, so nginx
inherits its 60s defaults. The chat SSE stream emits no bytes during
synchronous MCP tool calls, and run_pipeline's step-based progress
callback only fires between sources — the first visits load runs ~2-4
minutes silent for a 70k opp. Nginx sees >60s of upstream silence and
kills the connection at ~10:30:35, right on the default. The backend
materialization keeps running, then crashes downstream with
anyio.ClosedResourceError trying to push MCP progress notifications
through the now-closed stream.

The v2 JSON pagination migration (commit 3ed1364) fixed the downstream
leg (scout → commcare-connect bounded per page), but the upstream leg
(client → nginx → scout) still runs synchronously inside one request.
We only caught this testing the v2 migration on a large opp — the
~8k-visit opp 874 fit inside the 60s window and masked the issue.

Fix: Set proxy_read_timeout and proxy_send_timeout to 600s on the
/scout/api/ location, matching the ALB idle_timeout.timeout_seconds we
already have configured on labs-jj-alb (verified via aws elbv2
describe-load-balancer-attributes). Nginx is no longer the bottleneck
up to ~10 minutes of synchronous materialization, which handles every
realistic opp size including the 70k-row tenant 765 case.

Deliberately NOT fixed here (follow-up work):
  - Gunicorn workers still block for the full materialization duration,
    reducing backend concurrency under load.
  - The chat SSE stream still has no heartbeat during tool calls, so
    users see a stalled UI for minutes with no feedback.
  - For opps that exceed 10 minutes of materialization, the architectural
    fix is to move run_materialization to Celery (scout already has
    REDIS_URL and a get_materialization_status tool) and have the chat
    agent poll for status. Tracked as upstream work.

This is a frontend-only rebuild — NGINX_CONF is baked into the frontend
image at build time via the Dockerfile.frontend NGINX_CONF build arg.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(loaders): native types into typed raw_* columns (address PR #127 feedback from @snopoke)

Simon's review on #127:

> Why is everything being converted to strings? We'll loose all the type
> information by doing this.
>
> The root issue is that the connect tables all have TEXT columns. We
> should update these tables to use the correct data types.

He's right. The original migration (commit 3ed1364) stringified v2 JSON
scalars in the loaders to preserve the existing TEXT-column writer
contract — a scope-discipline choice that kept the blast radius small
but punted on the data-modeling fix. This commit unpunts it.

Because the materializer's writers DROP+CREATE the raw_* tables on every
materialization, there is no schema migration or data backfill: the next
sync for any tenant recreates the tables with the new types automatically
inside the same transaction. (Per user: scout is still in heavy active
development, no migration handling required.)

Changes
- mcp_server/loaders/connect_base.py: delete `stringify` and
  `stringify_record` helpers. `_paginate_export_pages` and the rest of
  the base class are unchanged.
- 6 simple loaders (users, completed_works, payments, invoices,
  assessments, completed_modules): stop calling `stringify_record`, just
  `yield page` directly. Each loader now collapses to a 10-line
  pass-through over `_paginate_export_pages`.
- mcp_server/loaders/connect_visits.py: `_normalize_visit` stops wrapping
  scalars in `stringify()`. Keeps the `id` → `visit_id` rename, the
  `opportunity_id` fallback to the loader's own int, and the defensive
  dict/list coercion for `form_json`/`images`.
- mcp_server/services/materializer.py: DDL + insert tuples updated for
  all 7 raw_* tables. Per-table type choices:

    raw_visits:
      visit_id                       TEXT     -> BIGINT PRIMARY KEY
      opportunity_id                 TEXT     -> BIGINT
      visit_date                     TEXT     -> TIMESTAMPTZ
      flagged                        TEXT     -> BOOLEAN
      status_modified_date           TEXT     -> TIMESTAMPTZ
      review_created_on              TEXT     -> TIMESTAMPTZ
      date_created                   TEXT     -> TIMESTAMPTZ
    raw_users:
      date_learn_started             TEXT     -> TIMESTAMPTZ
      payment_accrued                TEXT     -> NUMERIC(14, 2)
      suspended                      TEXT     -> BOOLEAN
      suspension_date                TEXT     -> TIMESTAMPTZ
      invited_date                   TEXT     -> TIMESTAMPTZ
      completed_learn_date           TEXT     -> TIMESTAMPTZ
      last_active                    TEXT     -> TIMESTAMPTZ
      date_claimed                   TEXT     -> TIMESTAMPTZ
    raw_completed_works:
      opportunity_id                 TEXT     -> BIGINT
      last_modified                  TEXT     -> TIMESTAMPTZ
      status_modified_date           TEXT     -> TIMESTAMPTZ
      payment_date                   TEXT     -> TIMESTAMPTZ
      date_created                   TEXT     -> TIMESTAMPTZ
      saved_completed_count          TEXT     -> INTEGER
      saved_approved_count           TEXT     -> INTEGER
      saved_payment_accrued          TEXT     -> NUMERIC(14, 2)
      saved_payment_accrued_usd      TEXT     -> NUMERIC(14, 2)
      saved_org_payment_accrued      TEXT     -> NUMERIC(14, 2)
      saved_org_payment_accrued_usd  TEXT     -> NUMERIC(14, 2)
    raw_payments:
      opportunity_id                 TEXT     -> BIGINT
      created_at                     TEXT     -> TIMESTAMPTZ
      amount, amount_usd             TEXT     -> NUMERIC(14, 2)
      date_paid, confirmation_date   TEXT     -> TIMESTAMPTZ
      confirmed                      TEXT     -> BOOLEAN
    raw_invoices:
      opportunity_id                 TEXT     -> BIGINT
      amount, amount_usd             TEXT     -> NUMERIC(14, 2)
      date                           TEXT     -> DATE (invoices are day-level)
      exchange_rate                  TEXT     -> NUMERIC(14, 6) for sub-cent precision
    raw_assessments:
      opportunity_id                 TEXT     -> BIGINT
      date                           TEXT     -> TIMESTAMPTZ
      score, passing_score           TEXT     -> INTEGER
      passed                         TEXT     -> BOOLEAN
    raw_completed_modules:
      opportunity_id                 TEXT     -> BIGINT
      date                           TEXT     -> TIMESTAMPTZ
      duration                       TEXT     -> INTEGER

  String-typed columns stay TEXT. Business IDs with ambiguous shape
  (`deliver_unit_id`, `entity_id`, `payment_unit_id`, `completed_work_id`,
  `invoice_id`, `invoice_number`, etc.) stay TEXT — we don't control
  their upstream representation and they might be UUIDs or slugs.

  Insert tuples changed from `r.get("field", "")` to `r.get("field")` for
  typed columns so missing keys bind to SQL NULL via psycopg, rather than
  passing an empty string that would fail to cast. Text columns keep the
  "" default for backwards-compat with downstream code that might rely on
  empty-string-not-null.

- tests/test_connect_data_loaders.py: the stringification assertions
  flip to native-type assertions. `test_flagged_bool_stringified` becomes
  `test_flagged_bool_passes_through` (asserts `rows[0]["flagged"] is True`).
  `test_native_types_stringified` becomes `test_native_types_preserved`
  and verifies each record value round-trips with the same type. Two new
  visit tests pin specific risks:
    - `test_flagged_false_not_coerced_to_none` guards against the
      `False or default` falsy trap when handling nullable booleans.
    - `test_missing_datetime_field_is_none` verifies that a null
      datetime in the JSON becomes Python None (not ""), so psycopg
      binds it to SQL NULL against the TIMESTAMPTZ column.

Downstream audit: grepped the full codebase for SQL that reads raw_*
assuming TEXT values. No DBT assets, no recipes, no agent prompt
templates, no Python result-processing code had TEXT assumptions. The
chat agent's LLM-generated SQL will actually improve — typed columns
mean it no longer needs to wrap amounts in ::numeric or compare
flagged = 'True' as string literals.

Verification
- 58 loader tests pass (was 56; added the two new visit tests).
- Full repo suite: 800 passed, 15 skipped, 0 regressions (was 798 pre-change).
- ruff check + format clean.
- Manual review of all 7 writer DDLs and insert tuples.
- Cannot unit-test the writer SQL without the test DB executing real
  INSERT statements, so the typed-column binding will be validated by
  the post-deploy production materialization retry on opp 874.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(materializer): correct column types for fields I got wrong in 234d043

Production materialization on tenant 765 crashed with
``psycopg.ProgrammingError: cannot adapt type 'dict' using placeholder '%s'``
(CloudWatch /ecs/labs-jj-scout-web 2026-04-09 20:18:57 UTC). The traceback
points at _write_connect_visits line 604 executemany.

Root cause: when I did the typed-columns refactor in 234d043 I only read
the UserVisit model far enough to see ``flagged`` and ``visit_date`` and
assumed the rest were plain CharFields. Wrong. A full walkthrough of
``commcare_connect/data_export/serializer.py`` and the underlying Django
models reveals 15 column type mismatches across all 7 raw_* tables. Every
one would crash or store wrong data on some tenant. My smoke test on opp
874 only worked because that opp had no flagged visits (so flag_reason
was always None and never hit psycopg's adapt check).

The 15 fixes:

raw_visits (5):
  flag_reason       TEXT -> JSONB     # Django JSONField, DRF returns dict
  deliver_unit      TEXT -> BIGINT    # ForeignKey, DRF renders PK int
  completed_work    TEXT -> BIGINT    # ForeignKey, DRF renders PK int
  completed_work_id TEXT -> BIGINT    # raw FK _id column, int
  deliver_unit_id   TEXT -> BIGINT    # raw FK _id column, int

raw_users (1):
  claim_limits      TEXT -> JSONB     # SerializerMethodField -> list[dict]

raw_completed_works (1):
  payment_unit_id   TEXT -> BIGINT    # raw FK _id column, int

raw_payments (2):
  payment_unit      TEXT -> BIGINT    # ForeignKey, int
  invoice_id        TEXT -> BIGINT    # raw FK _id column, int

raw_invoices (2):
  service_delivery  TEXT -> BOOLEAN   # actually PaymentInvoice.service_delivery
                                      # is a BooleanField, not a text label
  exchange_rate     NUMERIC -> BIGINT # actually a ForeignKey to
                                      # ExchangeRate lookup table, so it's
                                      # the FK PK not the numeric rate. My
                                      # 234d043 NUMERIC(14,6) guess was
                                      # semantically wrong too.

raw_assessments (1):
  app               TEXT -> BIGINT    # ForeignKey to CommCareApp, int

raw_completed_modules (2):
  module            TEXT -> BIGINT    # ForeignKey to LearnModule, int
  duration          INTEGER -> TEXT   # Django DurationField, DRF serializes
                                      # as string "H:MM:SS". Going to TEXT
                                      # for now — INTERVAL is the honest
                                      # type but needs more validation
                                      # against real payloads.

Writer changes
- New helper ``_json_or_none(value)`` in materializer.py for nullable
  JSONB columns. Preserves Python None → SQL NULL (json.dumps(None)
  would produce the string "null" which inserts as JSONB null, not
  SQL NULL — different semantics).
- ``_write_connect_visits`` uses ``_json_or_none`` for ``flag_reason``.
- ``_write_connect_users`` uses ``_json_or_none`` for ``claim_limits``.
- All FK/_id fields drop the ``r.get("field", "")`` empty-string default
  in favor of ``r.get("field")`` so missing keys bind to SQL NULL via
  psycopg (same pattern used for the already-typed columns in 234d043).

Test fixture updates
- ``_visit_record`` in test_connect_data_loaders.py now matches the real
  DRF output: deliver_unit/completed_work/_id fields are ints, flag_reason
  is None by default (populated in the new dict test).
- ``SIMPLE_LOADER_CASES`` matrix updated to match real serializer output
  per a walkthrough of serializer.py. payment_accrued is an IntegerField
  so it's now ``100`` not ``"100.00"``; claim_limits is now a list of
  dicts from get_claim_limits; service_delivery is a bool; exchange_rate
  is an int (FK); duration is the Django DurationField string "0:30:00";
  all FK fields across the matrix are now ints.
- Two new regression tests in TestConnectVisitLoader:
    test_flag_reason_dict_passes_through — pins the tenant 765 bug
    test_fk_ids_are_ints — pins the FK-as-int loader contract

Verification
- 60 loader tests pass (was 58; added the two visit regression pins).
- Full repo suite: 802 passed, 15 skipped, 0 regressions.
- ruff check + format clean.
- Manual walkthrough of every serializer field against the Django model
  + DRF field class to confirm the type mapping for each of the 15 fixes.

Gap I want to own: there's still no integration test that runs the
writer against a real Postgres and round-trips typed data. 234d043
shipped because the loader tests used requests_mock and never exercised
executemany. Adding a writer smoke test suite is the right follow-up so
the next bug like this fails in CI, not in production CloudWatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jonathan Jackson <jjackson@Jonathans-MBP.localdomain>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant