Skip to content

fix: disable HTTP/2, raise broadcast timeout to 30s, restore bounded retry for idempotent reads#257

Merged
ety001 merged 4 commits into
nextfrom
fix/disable-http2-and-tune-broadcast
Jun 1, 2026
Merged

fix: disable HTTP/2, raise broadcast timeout to 30s, restore bounded retry for idempotent reads#257
ety001 merged 4 commits into
nextfrom
fix/disable-http2-and-tune-broadcast

Conversation

@ety001

@ety001 ety001 commented May 29, 2026

Copy link
Copy Markdown
Member

Why

Wallets reported "broadcast timeout" after swapping traffic from legacy Python jussi to the Go rewrite. Triage via jaeger + watchtower revealed three independent gaps, addressed by this PR.

What this PR does

1. disable HTTP/2 and add connection rotation for upstream clients (existing commit 04b14b8)

Force HTTP/1.1 for upstream clients, cap MaxConnsPerHost, so the connection pool actually distributes across ALB backend instances instead of pinning to one.

2. enforce 15s minimum upstream timeout for broadcast methods (6172593)

Extracts selectUpstreamTimeout(req) as a pure function. Raises defaultUpstreamTimeout 3s → 15s and introduces broadcastMinimumTimeout = 15s (raised again in commit 3) so a forgotten or zeroed config entry can no longer clip a synchronous broadcast.

3. selective retry, 30s broadcast timeout, upstream observability (31a5ecb)

Three orthogonal improvements:

  • Selective retry: bring back bounded retry (≤2 attempts, 100–500 ms backoff with 25% jitter) for idempotent methods only via HTTPClient.RequestWithRetry + IsRetriableUpstreamError. broadcast_transaction* remains single-attempt — a retry could submit the same tx twice. The retry budget is intentionally tighter than legacy (3 attempts/multi-second backoff) so it cannot reintroduce the expiration-on-retry pattern that motivated 9cf36ea.
  • Broadcast minimum timeout 15s → 30s: production traces show sync broadcast p95 ≈ 3s but ALB-fronted steemd nodes occasionally hang 15+ s during fork/restart windows. 30s covers ~10 blocks of upstream slowness without unbounded waits. The parallel orchestration PR updates the production config to match.
  • Observability:
    • Startup INFO per upstream with parsed URL count + namespace/broadcast timeouts ("the config I shipped is the config that's running" in one glance).
    • Throttled INFO (1/method/min) when selectUpstreamTimeout actually floors a sub-minimum config to the broadcast minimum, surfacing config drift without log-spam.
    • Structured WARN on every broadcast upstream error with method / namespace / upstream_url / jussi_request_id / upstream_message.

Diagnostic evidence

  • Sample jaeger trace of a real broadcast failure during the incident showed total span 11.84 ms with upstream returned error: missing required active authority — not a timeout. Aggregating 500 traces in a 30-min window revealed 67% of broadcast errors were transaction_expiration_exception, all with expiration - jussi_entry_time = -80 to -96 minutes — i.e. wallets had signed those tx ~1.5 h earlier when the upstream was returning a stale head_block_time. Companion get_dynamic_global_properties traces showed the upstream actively returning Bad Cast: object_type to Array during that window.
  • Production broadcast_transaction_synchronous OK p95 = 3.07 s, max = 3.27 s. Sync broadcasts routinely sit just under the 3 s ceiling because they wait one block confirmation — under the previous 3 s default, any upstream slowness clipped them. The 30 s floor protects against that.

Test plan

  • go test ./... -race -count=1 -short — full repo PASS
  • go vet ./... — clean
  • go build ./... — clean
  • TestSelectUpstreamTimeout 8 + 1 cases (incl. 30 s floor sub/equal/above)
  • TestIsRetriableUpstreamError 18 cases (incl. wrapped context errors, 5xx vs 4xx)
  • TestRetryConfigBackoff* (exponential cap + jitter bounded)
  • TestDefaultRetryConfigSafe (asserts worst-case budget ≤ 2 s)
  • TestRequestWithRetry* (httptest: success-after-transient-500, no-retry-on-4xx, give-up-after-max-attempts, context-cancellation-stops-loop)
  • Deploy to staging EB env, swap a fraction of traffic, verify:
    • jaeger broadcast traces using timeout_applied=30s
    • INFO upstream registered lines on startup match production config
    • WARN broadcast upstream returned error only fires on actual upstream error

Out of scope (follow-up)

  • head_block_time upstream-lag detector with Prometheus gauge — needs a small internal/telemetry change, will land in a separate PR.
  • Splitting redis logical DBs between jussi-next (db=1) and legacy (db=0) so get_block cache hits stop being polluted across versions (orchestration-side change).

ety001 added 4 commits May 29, 2026 15:11
- Force HTTP/1.1 by setting NextProtos to ["http/1.1"] and
  ForceAttemptHTTP2=false to prevent HTTP/2 multiplexing
- HTTP/2 multiplexes all requests over a single TCP connection,
  which causes all traffic to hit one ALB target behind the scenes
- Set MaxConnsPerHost=20 to encourage connection rotation across
  multiple backend instances
- This should improve load distribution to steemd instances and
  reduce broadcast_transaction_synchronous failures caused by
  uneven load on a single steemd backend
Extracts upstream timeout resolution into selectUpstreamTimeout(req)
so the policy is explicit and testable:

- defaultUpstreamTimeout (used when the per-URN config sets timeout=0)
  raised from 3s to 15s to match the production upstream config.
- broadcastMinimumTimeout introduced at 15s: any broadcast_transaction*
  request always gets at least 15s regardless of config. A 3s clip on a
  synchronous broadcast surfaces to wallets as
  transaction_expiration_exception once the caller retries with a fresh
  expiration, even though the original tx already made it into a block.

The existing callHTTPUpstream now delegates to selectUpstreamTimeout
instead of inlining the timeout-or-default branch.

Test coverage:
- Table-driven selectUpstreamTimeout cases for non-broadcast / broadcast,
  config 0 / sub-minimum / above-minimum, both broadcast methods.
- Safety assertion that both constants stay above one block-confirmation
  window (10s) so future refactors can't silently shrink them.
Three orthogonal improvements that together close the gap between the Go
rewrite and the legacy Python jussi behaviour observed by wallets during
upstream steemd incidents:

1. Selective upstream retry (internal/upstream/retry.go,
   HTTPClient.RequestWithRetry, processor.callHTTPUpstream)

   Commit 9cf36ea removed all upstream retry to fix a different bug
   (retry-induced expiration), but this also removed legacy's implicit
   tolerance for transient ALB/steemd blips on read paths. Idempotent
   methods now retry up to 2 attempts with 100ms→500ms backoff + 25%
   jitter on retriable failures (connection reset, EOF, i/o timeout,
   HTTP 5xx). Non-retriable errors (context cancel/deadline, 4xx)
   short-circuit immediately. broadcast_transaction* methods are still
   single-attempt — a retry could submit the same tx twice.

   The retry budget is intentionally tighter than legacy (≤2 vs 3
   attempts, ≤500ms vs 5s backoff) so the worst-case retry tail stays
   well under typical wallet timeouts, avoiding the issue 9cf36ea fixed.

2. Broadcast minimum timeout 15s → 30s
   (internal/handlers/processor.go const broadcastMinimumTimeout)

   Production broadcast_synchronous traces show p95 wall time of ~3s
   (one block confirmation), but ALB-fronted steemd nodes can hang
   ≥15s during fork/restart windows. The 15s floor was clipping these
   recoverable cases. 30s covers ~10 blocks of upstream slowness while
   still bounding wallet response time. Production upstream config in
   the orchestration repo is updated in a parallel commit.

3. Observability for upstream config and broadcast errors
   (Router.logConfigSummary, selectUpstreamTimeout throttled log,
   processor.ProcessRequest broadcast-error structured log)

   - At startup each upstream emits one INFO line with its parsed URL
     count, translate_to_appbase flag, namespace timeout, and broadcast
     timeout — operators can verify "the config I shipped is the config
     that's running" in one glance.
   - selectUpstreamTimeout emits a throttled (1/method/minute) INFO log
     when the broadcast floor actually overrides a smaller configured
     value, surfacing config drift without log-spam.
   - Broadcast upstream errors emit a structured WARN log with method,
     namespace, upstream URL, jussi_request_id, and the upstream error
     message — the signal operators need to correlate "wallets seeing
     failed broadcasts" with an upstream incident, without paging
     through individual jaeger traces.

Test coverage
- internal/upstream/retry_test.go: IsRetriableUpstreamError truth table
  (15 cases incl. context cancel, wrapped errors, 5xx vs 4xx), backoff
  exponential-with-cap, jitter bounded-and-positive, default config
  safety assertion (worst-case budget ≤ 2s).
- internal/upstream/http_retry_test.go: httptest-backed RequestWithRetry
  covering success-after-transient-500, no-retry-on-4xx,
  give-up-after-max-attempts, context-cancellation-stops-loop.
- internal/handlers/processor_timeout_test.go: extended for the 30s
  floor (sub-minimum, equal-to-minimum, above-minimum, previous-15s
  raised to new 30s).
…log throttle

- Replace string-matching 5xx detection with UpstreamStatusError type
- Switch broadcast method detection from map to prefix match (broadcast_*)
- Make shouldLogBroadcastFloor concurrency-safe with CompareAndSwap/LoadOrStore
- Add nil-guard for req.URN in selectUpstreamTimeout
- Expand tests for edge cases and concurrent log throttling
@ety001 ety001 merged commit e79cbf2 into next Jun 1, 2026
2 checks passed
@ety001 ety001 deleted the fix/disable-http2-and-tune-broadcast branch June 1, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant