fix: add --forwarded-allow-ips to gunicorn so Django sees HTTPS scheme#1109
fix: add --forwarded-allow-ips to gunicorn so Django sees HTTPS scheme#1109jjackson wants to merge 1 commit intodimagi:mainfrom
Conversation
Gunicorn defaults `--forwarded-allow-ips` to `127.0.0.1`, meaning it only trusts `X-Forwarded-*` headers from loopback. In the containerized Kamal + Traefik deployment, Traefik runs in a separate container so its source IP is NOT 127.0.0.1, and gunicorn silently strips the `X-Forwarded-Proto: https` header before handing the request to Django. Downstream effects: - `request.scheme` returns 'http' even though the client request was HTTPS - `request.is_secure()` returns False - `request.build_absolute_uri()` emits URLs with `http://` scheme - Any code path that constructs canonical URLs from the request is broken This manifested during the Labs v2 paginated-JSON export rollout: the `IdKeysetPagination.get_next_link()` method uses `request.build_absolute_uri()` to construct the `next` URL, and was emitting `http://connect.dimagi.com/...` in paginated responses. Clients following the link hit the edge's canonical HTTPS redirect and received a 301, which broke any HTTP client that didn't have `follow_redirects=True`. The Labs client worked around it with redirect following, but the root cause is here. This is also latent for any future feature on the Connect side that relies on `build_absolute_uri()`, `is_secure()`, or any middleware that consumes the scheme (e.g., generating password-reset emails with absolute URLs, building webhook callback URLs, or any view that needs a canonical URL to render in a response). The fix is to tell gunicorn to trust `X-Forwarded-*` headers from any source. This is safe because gunicorn binds to `0.0.0.0:8000` inside a Docker network where only Traefik can reach it — there is no path for an untrusted source to inject forwarded headers. This is also the standard pattern for gunicorn-behind-a-reverse-proxy setups. Reference: https://docs.gunicorn.org/en/stable/settings.html#forwarded-allow-ips ## Evidence of the bug A direct reproduction against production `connect.dimagi.com` with the v2 JSON export API: ```python import httpx r = httpx.get( 'https://connect.dimagi.com/export/opportunity/<id>/user_visits/', headers={'Authorization': 'Bearer ...', 'Accept': 'application/json; version=2.0'}, ) # Status: 200 # Response body: # {"next": "http://connect.dimagi.com/export/opportunity/<id>/user_visits/?last_id=...", ...} # ^^^^ should be https ``` The server accepted the HTTPS request correctly but built the `next` URL with `http://` because Django thought the original request was HTTP. ## Verification steps After this change is deployed: 1. Request any paginated endpoint: `GET https://connect.dimagi.com/export/opportunity/<id>/user_visits/` with `Accept: application/json; version=2.0` header. 2. Verify the `next` field in the response body starts with `https://`, not `http://`. 3. Follow the `next` URL in a second request and confirm it returns 200 directly (not 301). 4. (Optional) Confirm `request.scheme == 'https'` in a Django view by adding a temporary logging statement, or via the Django admin.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe Docker start script was modified to add the Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ated JSON CommCare Connect deprecated its streaming CSV export endpoints in favor of keyset-paginated JSON (Accept: application/json; version=2.0) to avoid the 5-minute gunicorn worker timeout on large UserVisit exports. Scout's seven list-export loaders now follow the v2 contract: GET, then walk the `next` URL until null. Changes - ConnectBaseLoader: add `_paginate_export_pages()` (yields page lists, follows `next`, sends versioned Accept header per-call so the metadata loader's non-versioned endpoints stay untouched), plus `stringify` / `stringify_record` helpers and `ConnectExportError`. Drop `_get_csv` and the csv/io imports. - ConnectVisitLoader: rewrite `_normalize_visit` for v2 dict input, drop `ast.literal_eval` Python-repr fallback (form_json is now a real dict), preserve the `id` → `visit_id` rename, fall back to the loader's opportunity_id when the serializer omits it. - 6 simple loaders (users, completed_works, payments, invoices, assessments, completed_modules): thin paginate-then-stringify loops. Now stream page-by-page rather than buffering the full export. The downstream `_write_connect_*` writers in materializer.py are unchanged — scalars are stringified to preserve the v1 CSV-shaped TEXT-column contract; `form_json` and `images` pass through as native dict/list for the JSONB columns. Regression pin for dimagi/commcare-connect#1109 - Production has been observed returning `next` URLs with the `http://` scheme even on HTTPS requests (gunicorn `--forwarded-allow-ips` defaults to 127.0.0.1, strips `X-Forwarded-Proto`). Until #1109 lands, scout relies on requests' default `allow_redirects=True` to follow the edge's 301 → https, and on `Session.should_strip_auth`'s same-host HTTP→HTTPS upgrade exception to preserve the bearer token across the redirect. - Inline comment in `_paginate_export_pages` documents the dependency. - New test `test_follows_http_to_https_redirect_on_next_url` simulates the bug end-to-end and asserts both row aggregation and bearer-token preservation. Tests - 56 loader tests pass (12 base + 13 visit + 31 simple-loader matrix). - Full repo: 797 passed, 15 skipped, 0 regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add BASE_PATH config and Vite base path support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add BASE_PATH basename to routers and route detection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add popup OAuth flow for embed iframe authentication Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add iframe auto-resize via ResizeObserver and postMessage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add BASE_PATH and nginx config args to frontend Dockerfile Add VITE_BASE_PATH and NGINX_CONF build args to Dockerfile.frontend so connect-labs can build with /scout/ prefix. Add nginx.prod.conf with /scout/ prefix stripping for the production deploy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: skip allauth OAuth confirmation page Set SOCIALACCOUNT_LOGIN_ON_GET = True so clicking an OAuth button goes straight to the provider instead of showing an unstyled intermediate confirmation page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: skip email requirement for OAuth signups CommCare Connect's API doesn't return an email, causing allauth to show an unstyled signup form asking for one. Set SOCIALACCOUNT_EMAIL_REQUIRED = False so OAuth users are auto-created without needing an email address. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add connectlabs Django settings module Required by the ECS task definition which sets DJANGO_SETTINGS_MODULE=config.settings.connectlabs. Inherits production settings with ALB-specific overrides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prefix all API fetch calls with BASE_PATH The API client, chat transport, public pages, and onboarding wizard all used bare /api/ and /accounts/ paths. Under a subpath deploy like /scout/, these 404 because nginx only proxies /scout/api/ to the backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prefix OAuth login URLs with BASE_PATH The providers API returns bare /accounts/... URLs which 404 under a subpath deploy. Prefix both the href and the next redirect with BASE_PATH. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: add labs smoke test and embed enhancements plan - Playwright smoke test for labs.connect.dimagi.com end-to-end flow - Implementation plan for embed enhancements (reference doc) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR #95 review comments - Replace RegExp-based BASE_PATH stripping with startsWith/slice to avoid regex metacharacter bugs (e.g. dots in /scout.v2) - Revert SOCIALACCOUNT_LOGIN_ON_GET to False to prevent Login CSRF (the popup OAuth flow doesn't need it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: allow nullable email for OAuth users without email addresses CommCare Connect users may not have an email. The User model had email as unique+required, so the second email-less user caused a UniqueViolation on the empty string. Now email is nullable — empty values are stored as NULL (PostgreSQL allows multiple NULLs in a unique column). Includes a data migration to convert existing empty emails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: re-enable SOCIALACCOUNT_LOGIN_ON_GET for popup OAuth flow The popup is user-initiated so Login CSRF isn't a concern. With this set to False, users see an unstyled allauth confirmation page inside the popup before being redirected to the provider. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove window.opener check for OAuth popup close Cross-origin OAuth redirects (Scout → CommCare Connect → Scout) can clear window.opener in modern browsers, preventing the popup from auto-closing after login. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prefix popup OAuth URL with BASE_PATH The popup OAuth URL was missing the BASE_PATH prefix (/scout), so the popup navigated to the wrong path and fell through to ConnectLabs instead of Scout's allauth login endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: close OAuth popup instantly via inline script in index.html The popup was loading the full React app before the popup_close check ran, causing a visible flash. An inline script in index.html runs before any CSS/JS loads and closes the popup immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use window.name to detect OAuth popup instead of query param The allauth OAuth flow drops the `next` query param, so `popup_close=1` never reaches the final redirect URL. Instead, detect the popup by its window name ("scout-oauth") which persists across all navigations including cross-origin OAuth redirects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle OAuth popup in widget SDK instead of delegating to host The widget SDK now intercepts scout:auth-required and opens the OAuth popup directly to the provider login URL with window.name="scout-oauth". This bypasses the host app's auth overlay which was opening a popup to the standalone Scout app instead of the OAuth endpoint. After the popup closes, the iframe reloads to pick up the authenticated session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: optimize backend Dockerfile for code-only deploys + widget SDK OAuth popup Dockerfile: Separate dependency install from code copy so Docker layer cache skips the slow uv pip install step when only code changes (not pyproject.toml/uv.lock). This mirrors the frontend Dockerfile pattern. widget.js: Handle scout:auth-required directly in the SDK by opening the OAuth popup to the provider login URL with window.name="scout-oauth". Previously the host app (ConnectLabs) was opening a popup to standalone Scout instead of the OAuth endpoint, bypassing all popup close logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle OAuth popup in widget SDK instead of delegating to host Stop forwarding scout:auth-required to the host app (ConnectLabs). The host was showing its own auth overlay that hid Scout's LoginForm, then opening a popup to standalone Scout instead of the OAuth endpoint. Now the widget SDK suppresses the event so Scout's own LoginForm stays visible in the iframe. The LoginForm's popup OAuth flow handles auth correctly — opening directly to the provider with window.name="scout-oauth" and auto-closing via the inline script in index.html. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: set LOGIN_REDIRECT_URL to /scout/ in connectlabs settings Django doesn't prepend FORCE_SCRIPT_NAME to LOGIN_REDIRECT_URL, so after OAuth the popup was redirecting to / (ConnectLabs) instead of /scout/ (Scout). Now it lands on Scout's index.html where the inline script detects window.name="scout-oauth" and closes the popup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use cookie instead of window.name to detect OAuth popup CommCare Connect's COOP headers clear window.name during cross-origin OAuth redirects. Use a cookie instead — set before opening the popup, checked in index.html when the popup returns to Scout after OAuth. The cookie is cleared immediately after detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: restore popup_close query param check alongside cookie check Allauth DOES preserve the next param through the OAuth flow, so the popup lands at /scout/embed/?popup_close=1. Check both the cookie and the query param to close the popup — the query param works when allauth preserves next, the cookie is a fallback for when it doesn't. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prevent Login CSRF by keeping SOCIALACCOUNT_LOGIN_ON_GET=False Revert LOGIN_ON_GET to False to prevent Login CSRF attacks (attacker can craft a link that initiates OAuth and logs victim into attacker's account). Instead, the popup auto-submits the allauth confirmation form via JS — the form has a CSRF token so it's protected, and the user doesn't see the "Continue" button. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: simplify OAuth popup to match original proven approach Reverts LOGIN_ON_GET to False (prevents Login CSRF) and simplifies the popup flow back to the original pattern that worked reliably: 1. Embed LoginForm opens popup to standalone Scout (/scout/) 2. User sees normal Scout login form in popup, clicks OAuth provider 3. OAuth completes, popup loads authenticated Scout 4. App.tsx detects cookie + authenticated state -> window.close() 5. Iframe polls for popup close -> fetchMe() -> authenticated Removes: inline index.html scripts, auto-form-submit hacks, popup_close query params, window.name detection. Uses a simple cookie to identify popup windows instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: re-enable SOCIALACCOUNT_LOGIN_ON_GET to skip allauth confirmation The allauth confirmation page ("Sign In Via CommCare Connect" + Continue button) adds no value — the user already clicked an OAuth button, and the OAuth provider has its own authorize screen. Keeping it off caused the popup to get stuck on an unstyled intermediate page. Login CSRF risk is mitigated by the provider's own authorization screen which requires explicit user consent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: always return arrays from artifact live queries and strengthen source_queries prompt (#1) Two bugs found during artifact walkthrough testing: 1. Single-row query results caused `rows.map is not a function` because mergeQueryResults() converted single-row results to objects. Components always expect arrays. Now always returns arrays regardless of row count. 2. Agent sometimes embedded static data instead of using source_queries, breaking the "always-fresh data" contract. Strengthened the artifact prompt: CRITICAL language, removed the "< 5 rows" exception, added explicit "NEVER embed" instruction, updated docs to reflect arrays-only. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: embed sidebar navigation 404 — prefix routes with /embed Sidebar links (/, /knowledge, /artifacts, etc.) didn't account for the embed router's /embed prefix, causing React Router 404 on every click in the ConnectLabs iframe. Detect embed mode via useEmbedParams and prefix all navigation paths accordingly. Also add missing embed routes (data-dictionary, settings/connections) and a catch-all redirect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: inject user_id into MCP tool calls to prevent cross-user credential use When multiple users share a workspace, run_materialization could pick up a different user's OAuth token. Now user_id is injected server-side (same as workspace_id) and used to filter the TenantMembership query, ensuring each user's MCP calls use only their own credentials. Addresses review feedback from @snopoke on PR #107. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: replace embed OAuth popup with direct OAuth + token relay The embedded login had three compounding problems: the popup opened Scout's root (showing the login form twice), no feedback while the popup was open, and third-party cookie blocking prevented the session from reaching the iframe. Replace the cookie-based popup detection approach with the standard embedded OAuth pattern: popup goes directly to the OAuth provider, a lightweight callback page sends a signed one-time token via postMessage, and the iframe exchanges it for a session with SameSite=None cookies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: popup-complete redirect fallback and spinner persistence Two fixes for the embed OAuth flow: 1. When allauth doesn't preserve the `next` URL through OAuth and the popup loads the full app, App.tsx now detects the same-origin opener and redirects to /auth/popup-complete/ automatically. 2. fetchMe() no longer sets authStatus to "loading" during re-checks (only on initial idle state). This prevents EmbedPage from unmounting LoginForm during visibilitychange, which was losing the popup spinner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: simplify embed OAuth to direct navigation (no popup) The popup approach was unnecessary — Scout and connect-labs are on the same domain, so there are no cross-site cookie issues. Replace the entire popup/postMessage/token-exchange mechanism with a simple target="_top" link that navigates the full page through OAuth, then returns to the connect-labs embed page. This removes ~290 lines: popup-complete view, token-exchange endpoint, postMessage handling, window.opener detection, spinner state management. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove double BASE_PATH prefix on OAuth login URLs The providers API was prepending FORCE_SCRIPT_NAME to login_url, but the frontend already prepends BASE_PATH. This produced /scout/scout/... URLs that hit the SPA fallback instead of reaching Django's allauth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(loaders): migrate Connect exports from streaming CSV to v2 paginated JSON CommCare Connect deprecated its streaming CSV export endpoints in favor of keyset-paginated JSON (Accept: application/json; version=2.0) to avoid the 5-minute gunicorn worker timeout on large UserVisit exports. Scout's seven list-export loaders now follow the v2 contract: GET, then walk the `next` URL until null. Changes - ConnectBaseLoader: add `_paginate_export_pages()` (yields page lists, follows `next`, sends versioned Accept header per-call so the metadata loader's non-versioned endpoints stay untouched), plus `stringify` / `stringify_record` helpers and `ConnectExportError`. Drop `_get_csv` and the csv/io imports. - ConnectVisitLoader: rewrite `_normalize_visit` for v2 dict input, drop `ast.literal_eval` Python-repr fallback (form_json is now a real dict), preserve the `id` → `visit_id` rename, fall back to the loader's opportunity_id when the serializer omits it. - 6 simple loaders (users, completed_works, payments, invoices, assessments, completed_modules): thin paginate-then-stringify loops. Now stream page-by-page rather than buffering the full export. The downstream `_write_connect_*` writers in materializer.py are unchanged — scalars are stringified to preserve the v1 CSV-shaped TEXT-column contract; `form_json` and `images` pass through as native dict/list for the JSONB columns. Regression pin for dimagi/commcare-connect#1109 - Production has been observed returning `next` URLs with the `http://` scheme even on HTTPS requests (gunicorn `--forwarded-allow-ips` defaults to 127.0.0.1, strips `X-Forwarded-Proto`). Until #1109 lands, scout relies on requests' default `allow_redirects=True` to follow the edge's 301 → https, and on `Session.should_strip_auth`'s same-host HTTP→HTTPS upgrade exception to preserve the bearer token across the redirect. - Inline comment in `_paginate_export_pages` documents the dependency. - New test `test_follows_http_to_https_redirect_on_next_url` simulates the bug end-to-end and asserts both row aggregation and bearer-token preservation. Tests - 56 loader tests pass (12 base + 13 visit + 31 simple-loader matrix). - Full repo: 797 passed, 15 skipped, 0 regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(nginx): raise proxy_*_timeout to 600s on /scout/api/ for long materializations Symptom: "network error" in the chat UI when running run_materialization on a ~70k-row opportunity (tenant 765). CloudWatch /ecs/labs-jj-scout-web 2026-04-08 10:30:35 UTC: [error] *3796 upstream timed out (110: Operation timed out) while reading upstream, request: "POST /scout/api/chat/", upstream: "http://127.0.0.1:8000/api/chat/" Root cause: The /scout/api/ nginx location in frontend/nginx.prod.conf sets proxy_buffering off and chunked_transfer_encoding on (correct for SSE) but does not set proxy_read_timeout or proxy_send_timeout, so nginx inherits its 60s defaults. The chat SSE stream emits no bytes during synchronous MCP tool calls, and run_pipeline's step-based progress callback only fires between sources — the first visits load runs ~2-4 minutes silent for a 70k opp. Nginx sees >60s of upstream silence and kills the connection at ~10:30:35, right on the default. The backend materialization keeps running, then crashes downstream with anyio.ClosedResourceError trying to push MCP progress notifications through the now-closed stream. The v2 JSON pagination migration (commit 3ed1364) fixed the downstream leg (scout → commcare-connect bounded per page), but the upstream leg (client → nginx → scout) still runs synchronously inside one request. We only caught this testing the v2 migration on a large opp — the ~8k-visit opp 874 fit inside the 60s window and masked the issue. Fix: Set proxy_read_timeout and proxy_send_timeout to 600s on the /scout/api/ location, matching the ALB idle_timeout.timeout_seconds we already have configured on labs-jj-alb (verified via aws elbv2 describe-load-balancer-attributes). Nginx is no longer the bottleneck up to ~10 minutes of synchronous materialization, which handles every realistic opp size including the 70k-row tenant 765 case. Deliberately NOT fixed here (follow-up work): - Gunicorn workers still block for the full materialization duration, reducing backend concurrency under load. - The chat SSE stream still has no heartbeat during tool calls, so users see a stalled UI for minutes with no feedback. - For opps that exceed 10 minutes of materialization, the architectural fix is to move run_materialization to Celery (scout already has REDIS_URL and a get_materialization_status tool) and have the chat agent poll for status. Tracked as upstream work. This is a frontend-only rebuild — NGINX_CONF is baked into the frontend image at build time via the Dockerfile.frontend NGINX_CONF build arg. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(loaders): native types into typed raw_* columns (address PR #127 feedback from @snopoke) Simon's review on #127: > Why is everything being converted to strings? We'll loose all the type > information by doing this. > > The root issue is that the connect tables all have TEXT columns. We > should update these tables to use the correct data types. He's right. The original migration (commit 3ed1364) stringified v2 JSON scalars in the loaders to preserve the existing TEXT-column writer contract — a scope-discipline choice that kept the blast radius small but punted on the data-modeling fix. This commit unpunts it. Because the materializer's writers DROP+CREATE the raw_* tables on every materialization, there is no schema migration or data backfill: the next sync for any tenant recreates the tables with the new types automatically inside the same transaction. (Per user: scout is still in heavy active development, no migration handling required.) Changes - mcp_server/loaders/connect_base.py: delete `stringify` and `stringify_record` helpers. `_paginate_export_pages` and the rest of the base class are unchanged. - 6 simple loaders (users, completed_works, payments, invoices, assessments, completed_modules): stop calling `stringify_record`, just `yield page` directly. Each loader now collapses to a 10-line pass-through over `_paginate_export_pages`. - mcp_server/loaders/connect_visits.py: `_normalize_visit` stops wrapping scalars in `stringify()`. Keeps the `id` → `visit_id` rename, the `opportunity_id` fallback to the loader's own int, and the defensive dict/list coercion for `form_json`/`images`. - mcp_server/services/materializer.py: DDL + insert tuples updated for all 7 raw_* tables. Per-table type choices: raw_visits: visit_id TEXT -> BIGINT PRIMARY KEY opportunity_id TEXT -> BIGINT visit_date TEXT -> TIMESTAMPTZ flagged TEXT -> BOOLEAN status_modified_date TEXT -> TIMESTAMPTZ review_created_on TEXT -> TIMESTAMPTZ date_created TEXT -> TIMESTAMPTZ raw_users: date_learn_started TEXT -> TIMESTAMPTZ payment_accrued TEXT -> NUMERIC(14, 2) suspended TEXT -> BOOLEAN suspension_date TEXT -> TIMESTAMPTZ invited_date TEXT -> TIMESTAMPTZ completed_learn_date TEXT -> TIMESTAMPTZ last_active TEXT -> TIMESTAMPTZ date_claimed TEXT -> TIMESTAMPTZ raw_completed_works: opportunity_id TEXT -> BIGINT last_modified TEXT -> TIMESTAMPTZ status_modified_date TEXT -> TIMESTAMPTZ payment_date TEXT -> TIMESTAMPTZ date_created TEXT -> TIMESTAMPTZ saved_completed_count TEXT -> INTEGER saved_approved_count TEXT -> INTEGER saved_payment_accrued TEXT -> NUMERIC(14, 2) saved_payment_accrued_usd TEXT -> NUMERIC(14, 2) saved_org_payment_accrued TEXT -> NUMERIC(14, 2) saved_org_payment_accrued_usd TEXT -> NUMERIC(14, 2) raw_payments: opportunity_id TEXT -> BIGINT created_at TEXT -> TIMESTAMPTZ amount, amount_usd TEXT -> NUMERIC(14, 2) date_paid, confirmation_date TEXT -> TIMESTAMPTZ confirmed TEXT -> BOOLEAN raw_invoices: opportunity_id TEXT -> BIGINT amount, amount_usd TEXT -> NUMERIC(14, 2) date TEXT -> DATE (invoices are day-level) exchange_rate TEXT -> NUMERIC(14, 6) for sub-cent precision raw_assessments: opportunity_id TEXT -> BIGINT date TEXT -> TIMESTAMPTZ score, passing_score TEXT -> INTEGER passed TEXT -> BOOLEAN raw_completed_modules: opportunity_id TEXT -> BIGINT date TEXT -> TIMESTAMPTZ duration TEXT -> INTEGER String-typed columns stay TEXT. Business IDs with ambiguous shape (`deliver_unit_id`, `entity_id`, `payment_unit_id`, `completed_work_id`, `invoice_id`, `invoice_number`, etc.) stay TEXT — we don't control their upstream representation and they might be UUIDs or slugs. Insert tuples changed from `r.get("field", "")` to `r.get("field")` for typed columns so missing keys bind to SQL NULL via psycopg, rather than passing an empty string that would fail to cast. Text columns keep the "" default for backwards-compat with downstream code that might rely on empty-string-not-null. - tests/test_connect_data_loaders.py: the stringification assertions flip to native-type assertions. `test_flagged_bool_stringified` becomes `test_flagged_bool_passes_through` (asserts `rows[0]["flagged"] is True`). `test_native_types_stringified` becomes `test_native_types_preserved` and verifies each record value round-trips with the same type. Two new visit tests pin specific risks: - `test_flagged_false_not_coerced_to_none` guards against the `False or default` falsy trap when handling nullable booleans. - `test_missing_datetime_field_is_none` verifies that a null datetime in the JSON becomes Python None (not ""), so psycopg binds it to SQL NULL against the TIMESTAMPTZ column. Downstream audit: grepped the full codebase for SQL that reads raw_* assuming TEXT values. No DBT assets, no recipes, no agent prompt templates, no Python result-processing code had TEXT assumptions. The chat agent's LLM-generated SQL will actually improve — typed columns mean it no longer needs to wrap amounts in ::numeric or compare flagged = 'True' as string literals. Verification - 58 loader tests pass (was 56; added the two new visit tests). - Full repo suite: 800 passed, 15 skipped, 0 regressions (was 798 pre-change). - ruff check + format clean. - Manual review of all 7 writer DDLs and insert tuples. - Cannot unit-test the writer SQL without the test DB executing real INSERT statements, so the typed-column binding will be validated by the post-deploy production materialization retry on opp 874. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(materializer): correct column types for fields I got wrong in 234d043 Production materialization on tenant 765 crashed with ``psycopg.ProgrammingError: cannot adapt type 'dict' using placeholder '%s'`` (CloudWatch /ecs/labs-jj-scout-web 2026-04-09 20:18:57 UTC). The traceback points at _write_connect_visits line 604 executemany. Root cause: when I did the typed-columns refactor in 234d043 I only read the UserVisit model far enough to see ``flagged`` and ``visit_date`` and assumed the rest were plain CharFields. Wrong. A full walkthrough of ``commcare_connect/data_export/serializer.py`` and the underlying Django models reveals 15 column type mismatches across all 7 raw_* tables. Every one would crash or store wrong data on some tenant. My smoke test on opp 874 only worked because that opp had no flagged visits (so flag_reason was always None and never hit psycopg's adapt check). The 15 fixes: raw_visits (5): flag_reason TEXT -> JSONB # Django JSONField, DRF returns dict deliver_unit TEXT -> BIGINT # ForeignKey, DRF renders PK int completed_work TEXT -> BIGINT # ForeignKey, DRF renders PK int completed_work_id TEXT -> BIGINT # raw FK _id column, int deliver_unit_id TEXT -> BIGINT # raw FK _id column, int raw_users (1): claim_limits TEXT -> JSONB # SerializerMethodField -> list[dict] raw_completed_works (1): payment_unit_id TEXT -> BIGINT # raw FK _id column, int raw_payments (2): payment_unit TEXT -> BIGINT # ForeignKey, int invoice_id TEXT -> BIGINT # raw FK _id column, int raw_invoices (2): service_delivery TEXT -> BOOLEAN # actually PaymentInvoice.service_delivery # is a BooleanField, not a text label exchange_rate NUMERIC -> BIGINT # actually a ForeignKey to # ExchangeRate lookup table, so it's # the FK PK not the numeric rate. My # 234d043 NUMERIC(14,6) guess was # semantically wrong too. raw_assessments (1): app TEXT -> BIGINT # ForeignKey to CommCareApp, int raw_completed_modules (2): module TEXT -> BIGINT # ForeignKey to LearnModule, int duration INTEGER -> TEXT # Django DurationField, DRF serializes # as string "H:MM:SS". Going to TEXT # for now — INTERVAL is the honest # type but needs more validation # against real payloads. Writer changes - New helper ``_json_or_none(value)`` in materializer.py for nullable JSONB columns. Preserves Python None → SQL NULL (json.dumps(None) would produce the string "null" which inserts as JSONB null, not SQL NULL — different semantics). - ``_write_connect_visits`` uses ``_json_or_none`` for ``flag_reason``. - ``_write_connect_users`` uses ``_json_or_none`` for ``claim_limits``. - All FK/_id fields drop the ``r.get("field", "")`` empty-string default in favor of ``r.get("field")`` so missing keys bind to SQL NULL via psycopg (same pattern used for the already-typed columns in 234d043). Test fixture updates - ``_visit_record`` in test_connect_data_loaders.py now matches the real DRF output: deliver_unit/completed_work/_id fields are ints, flag_reason is None by default (populated in the new dict test). - ``SIMPLE_LOADER_CASES`` matrix updated to match real serializer output per a walkthrough of serializer.py. payment_accrued is an IntegerField so it's now ``100`` not ``"100.00"``; claim_limits is now a list of dicts from get_claim_limits; service_delivery is a bool; exchange_rate is an int (FK); duration is the Django DurationField string "0:30:00"; all FK fields across the matrix are now ints. - Two new regression tests in TestConnectVisitLoader: test_flag_reason_dict_passes_through — pins the tenant 765 bug test_fk_ids_are_ints — pins the FK-as-int loader contract Verification - 60 loader tests pass (was 58; added the two visit regression pins). - Full repo suite: 802 passed, 15 skipped, 0 regressions. - ruff check + format clean. - Manual walkthrough of every serializer field against the Django model + DRF field class to confirm the type mapping for each of the 15 fixes. Gap I want to own: there's still no integration test that runs the writer against a real Postgres and round-trips typed data. 234d043 shipped because the loader tests used requests_mock and never exercised executemany. Adding a writer smoke test suite is the right follow-up so the next bug like this fails in CI, not in production CloudWatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jonathan Jackson <jjackson@Jonathans-MBP.localdomain> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
One-line fix: add
--forwarded-allow-ips=\"*\"to the gunicorn command indocker/startso Django'srequest.schemecorrectly reflects HTTPS when the request comes in through Traefik.The bug
Gunicorn defaults
--forwarded-allow-ipsto127.0.0.1, meaning it only trustsX-Forwarded-*headers from loopback. In our containerized Kamal + Traefik deployment, Traefik runs in a separate container so its source IP is NOT 127.0.0.1, and gunicorn silently strips theX-Forwarded-Proto: httpsheader before handing the request to Django.Downstream effects inside Django:
request.schemereturns'http'even though the client request was HTTPSrequest.is_secure()returnsFalserequest.build_absolute_uri()emits URLs withhttp://schemeHow we found it
During the CommCare Connect Labs v2 paginated-JSON export migration (replacing the v1 streaming CSV exports), we ran an end-to-end parity check against
connect.dimagi.comand hit the bug immediately.IdKeysetPagination.get_next_link()incommcare_connect/data_export/pagination.pyusesrequest.build_absolute_uri()to construct thenextURL, and production was returning paginated responses withhttp://in thenextfield:The server accepted the HTTPS request correctly but built the
nextURL withhttp://because Django thought the original request was HTTP. Clients following the link hit the edge's canonical HTTPS redirect and received a 301, which broke any HTTP client that didn't havefollow_redirects=True(httpx's default is false).The Labs client worked around it with redirect following, but the root cause is here — and this affects any code path that uses
build_absolute_uri()oris_secure(), not just pagination. Latent examples I spotted while investigating:SECURE_SSL_REDIRECTmiddleware (currently defaults toFalseinstaging.py, but would infinite-loop if ever flipped toTruegiven the current gunicorn config)The fix
Tell gunicorn to trust
X-Forwarded-*headers from any source.Why this is safe
Gunicorn binds to
0.0.0.0:8000inside the Docker network, where only Traefik can reach it. There is no path for an untrusted source to inject forged forwarded headers — an external attacker would need to first compromise the container network, at which point forwarded-header spoofing is the least of your concerns.This is also the documented pattern for gunicorn-behind-a-reverse-proxy setups: https://docs.gunicorn.org/en/stable/settings.html#forwarded-allow-ips
Alternative: specify the exact Traefik container's CIDR (e.g.,
172.17.0.0/16for the default Docker bridge network). That's slightly more defensive but brittle — any change to the Docker network layout breaks it. The\"*\"pattern is the standard recommendation for containerized reverse-proxy setups.Test plan
After this lands and deploys:
Hit any paginated v2 endpoint:
Verify the
nextfield in the response body starts withhttps://, nothttp://Follow the
nextURL in a second request and confirm it returns 200 directly (not 301)Spot-check any existing feature that uses
build_absolute_uri()to construct URLs (e.g., email templates, webhook URLs) — these should now generatehttps://URLsRelated
ExportAPIClientusesfollow_redirects=Trueas a defensive workaround; that can stay in place even after this upstream fix lands, but will become a no-op oncenextURLs are correct.🤖 Generated with Claude Code