fix: disable HTTP/2, raise broadcast timeout to 30s, restore bounded retry for idempotent reads by ety001 · Pull Request #257 · steemit/jussi

ety001 · 2026-05-29T23:24:32Z

Why

Wallets reported "broadcast timeout" after swapping traffic from legacy Python jussi to the Go rewrite. Triage via jaeger + watchtower revealed three independent gaps, addressed by this PR.

What this PR does

1. `disable HTTP/2 and add connection rotation for upstream clients` (existing commit `04b14b8`)

Force HTTP/1.1 for upstream clients, cap MaxConnsPerHost, so the connection pool actually distributes across ALB backend instances instead of pinning to one.

2. `enforce 15s minimum upstream timeout for broadcast methods` (`6172593`)

Extracts selectUpstreamTimeout(req) as a pure function. Raises defaultUpstreamTimeout 3s → 15s and introduces broadcastMinimumTimeout = 15s (raised again in commit 3) so a forgotten or zeroed config entry can no longer clip a synchronous broadcast.

3. `selective retry, 30s broadcast timeout, upstream observability` (`31a5ecb`)

Three orthogonal improvements:

Selective retry: bring back bounded retry (≤2 attempts, 100–500 ms backoff with 25% jitter) for idempotent methods only via HTTPClient.RequestWithRetry + IsRetriableUpstreamError. broadcast_transaction* remains single-attempt — a retry could submit the same tx twice. The retry budget is intentionally tighter than legacy (3 attempts/multi-second backoff) so it cannot reintroduce the expiration-on-retry pattern that motivated 9cf36ea.
Broadcast minimum timeout 15s → 30s: production traces show sync broadcast p95 ≈ 3s but ALB-fronted steemd nodes occasionally hang 15+ s during fork/restart windows. 30s covers ~10 blocks of upstream slowness without unbounded waits. The parallel orchestration PR updates the production config to match.
Observability:
- Startup INFO per upstream with parsed URL count + namespace/broadcast timeouts ("the config I shipped is the config that's running" in one glance).
- Throttled INFO (1/method/min) when selectUpstreamTimeout actually floors a sub-minimum config to the broadcast minimum, surfacing config drift without log-spam.
- Structured WARN on every broadcast upstream error with method / namespace / upstream_url / jussi_request_id / upstream_message.

Diagnostic evidence

Sample jaeger trace of a real broadcast failure during the incident showed total span 11.84 ms with upstream returned error: missing required active authority — not a timeout. Aggregating 500 traces in a 30-min window revealed 67% of broadcast errors were transaction_expiration_exception, all with expiration - jussi_entry_time = -80 to -96 minutes — i.e. wallets had signed those tx ~1.5 h earlier when the upstream was returning a stale head_block_time. Companion get_dynamic_global_properties traces showed the upstream actively returning Bad Cast: object_type to Array during that window.
Production broadcast_transaction_synchronous OK p95 = 3.07 s, max = 3.27 s. Sync broadcasts routinely sit just under the 3 s ceiling because they wait one block confirmation — under the previous 3 s default, any upstream slowness clipped them. The 30 s floor protects against that.

Test plan

go test ./... -race -count=1 -short — full repo PASS
go vet ./... — clean
go build ./... — clean
TestSelectUpstreamTimeout 8 + 1 cases (incl. 30 s floor sub/equal/above)
TestIsRetriableUpstreamError 18 cases (incl. wrapped context errors, 5xx vs 4xx)
TestRetryConfigBackoff* (exponential cap + jitter bounded)
TestDefaultRetryConfigSafe (asserts worst-case budget ≤ 2 s)
TestRequestWithRetry* (httptest: success-after-transient-500, no-retry-on-4xx, give-up-after-max-attempts, context-cancellation-stops-loop)
Deploy to staging EB env, swap a fraction of traffic, verify:
- jaeger broadcast traces using timeout_applied=30s
- INFO upstream registered lines on startup match production config
- WARN broadcast upstream returned error only fires on actual upstream error

Out of scope (follow-up)

head_block_time upstream-lag detector with Prometheus gauge — needs a small internal/telemetry change, will land in a separate PR.
Splitting redis logical DBs between jussi-next (db=1) and legacy (db=0) so get_block cache hits stop being polluted across versions (orchestration-side change).

- Force HTTP/1.1 by setting NextProtos to ["http/1.1"] and ForceAttemptHTTP2=false to prevent HTTP/2 multiplexing - HTTP/2 multiplexes all requests over a single TCP connection, which causes all traffic to hit one ALB target behind the scenes - Set MaxConnsPerHost=20 to encourage connection rotation across multiple backend instances - This should improve load distribution to steemd instances and reduce broadcast_transaction_synchronous failures caused by uneven load on a single steemd backend

Extracts upstream timeout resolution into selectUpstreamTimeout(req) so the policy is explicit and testable: - defaultUpstreamTimeout (used when the per-URN config sets timeout=0) raised from 3s to 15s to match the production upstream config. - broadcastMinimumTimeout introduced at 15s: any broadcast_transaction* request always gets at least 15s regardless of config. A 3s clip on a synchronous broadcast surfaces to wallets as transaction_expiration_exception once the caller retries with a fresh expiration, even though the original tx already made it into a block. The existing callHTTPUpstream now delegates to selectUpstreamTimeout instead of inlining the timeout-or-default branch. Test coverage: - Table-driven selectUpstreamTimeout cases for non-broadcast / broadcast, config 0 / sub-minimum / above-minimum, both broadcast methods. - Safety assertion that both constants stay above one block-confirmation window (10s) so future refactors can't silently shrink them.

Three orthogonal improvements that together close the gap between the Go rewrite and the legacy Python jussi behaviour observed by wallets during upstream steemd incidents: 1. Selective upstream retry (internal/upstream/retry.go, HTTPClient.RequestWithRetry, processor.callHTTPUpstream) Commit 9cf36ea removed all upstream retry to fix a different bug (retry-induced expiration), but this also removed legacy's implicit tolerance for transient ALB/steemd blips on read paths. Idempotent methods now retry up to 2 attempts with 100ms→500ms backoff + 25% jitter on retriable failures (connection reset, EOF, i/o timeout, HTTP 5xx). Non-retriable errors (context cancel/deadline, 4xx) short-circuit immediately. broadcast_transaction* methods are still single-attempt — a retry could submit the same tx twice. The retry budget is intentionally tighter than legacy (≤2 vs 3 attempts, ≤500ms vs 5s backoff) so the worst-case retry tail stays well under typical wallet timeouts, avoiding the issue 9cf36ea fixed. 2. Broadcast minimum timeout 15s → 30s (internal/handlers/processor.go const broadcastMinimumTimeout) Production broadcast_synchronous traces show p95 wall time of ~3s (one block confirmation), but ALB-fronted steemd nodes can hang ≥15s during fork/restart windows. The 15s floor was clipping these recoverable cases. 30s covers ~10 blocks of upstream slowness while still bounding wallet response time. Production upstream config in the orchestration repo is updated in a parallel commit. 3. Observability for upstream config and broadcast errors (Router.logConfigSummary, selectUpstreamTimeout throttled log, processor.ProcessRequest broadcast-error structured log) - At startup each upstream emits one INFO line with its parsed URL count, translate_to_appbase flag, namespace timeout, and broadcast timeout — operators can verify "the config I shipped is the config that's running" in one glance. - selectUpstreamTimeout emits a throttled (1/method/minute) INFO log when the broadcast floor actually overrides a smaller configured value, surfacing config drift without log-spam. - Broadcast upstream errors emit a structured WARN log with method, namespace, upstream URL, jussi_request_id, and the upstream error message — the signal operators need to correlate "wallets seeing failed broadcasts" with an upstream incident, without paging through individual jaeger traces. Test coverage - internal/upstream/retry_test.go: IsRetriableUpstreamError truth table (15 cases incl. context cancel, wrapped errors, 5xx vs 4xx), backoff exponential-with-cap, jitter bounded-and-positive, default config safety assertion (worst-case budget ≤ 2s). - internal/upstream/http_retry_test.go: httptest-backed RequestWithRetry covering success-after-transient-500, no-retry-on-4xx, give-up-after-max-attempts, context-cancellation-stops-loop. - internal/handlers/processor_timeout_test.go: extended for the 30s floor (sub-minimum, equal-to-minimum, above-minimum, previous-15s raised to new 30s).

…log throttle - Replace string-matching 5xx detection with UpstreamStatusError type - Switch broadcast method detection from map to prefix match (broadcast_*) - Make shouldLogBroadcastFloor concurrency-safe with CompareAndSwap/LoadOrStore - Add nil-guard for req.URN in selectUpstreamTimeout - Expand tests for edge cases and concurrent log throttling

ety001 added 4 commits May 29, 2026 15:11

ety001 merged commit e79cbf2 into next Jun 1, 2026
2 checks passed

ety001 deleted the fix/disable-http2-and-tune-broadcast branch June 1, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disable HTTP/2, raise broadcast timeout to 30s, restore bounded retry for idempotent reads#257

fix: disable HTTP/2, raise broadcast timeout to 30s, restore bounded retry for idempotent reads#257
ety001 merged 4 commits into
nextfrom
fix/disable-http2-and-tune-broadcast

ety001 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ety001 commented May 29, 2026

Why

What this PR does

1. disable HTTP/2 and add connection rotation for upstream clients (existing commit 04b14b8)

2. enforce 15s minimum upstream timeout for broadcast methods (6172593)

3. selective retry, 30s broadcast timeout, upstream observability (31a5ecb)

Diagnostic evidence

Test plan

Out of scope (follow-up)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `disable HTTP/2 and add connection rotation for upstream clients` (existing commit `04b14b8`)

2. `enforce 15s minimum upstream timeout for broadcast methods` (`6172593`)

3. `selective retry, 30s broadcast timeout, upstream observability` (`31a5ecb`)