Skip to content

Collector hardening: aliases, rollback, PG18, leak#11

Merged
sergeyfast merged 5 commits into
masterfrom
feat/collector-hardening
May 13, 2026
Merged

Collector hardening: aliases, rollback, PG18, leak#11
sergeyfast merged 5 commits into
masterfrom
feat/collector-hardening

Conversation

@sergeyfast
Copy link
Copy Markdown
Contributor

Summary

Four independent fixes batched on one branch; each commit stands on its own and is safe to revert in isolation.

1. b61f64a Refine botlog field alias resolution

  • Rename FieldAliases.MergeWithFallback so the call site (override.WithFallback(detected).WithFallback(defaults)) reads left-to-right.
  • Add realip_remote_addr to remote_addr candidate list — Angie/nginx realip-style log_formats now resolve correctly instead of falling back to peer IP.
  • Memoise compiled regexes and per-format detection results across tailed paths sharing one log_format.
  • Log per-field alias provenance (override/detected/default) and canonical_path at startup so oncall can see why each field resolved as it did.
  • Add botlog_alias_mismatch to topsrv_collector_config_warnings_total help-string; point botlog_no_ua_field warning at [BotLogs.FieldAliases].UserAgent as the escape hatch.
  • Update README + docs/metrics.md to describe auto-detect, alias candidates, and the new provenance log line.

2. 2de76f6 Skip non-crash restarts in update rollback check

Fixes a real false-positive: 7 manual restarts inside the 5-min post-update window had tripped a rollback to v0.0.21 even though nothing had actually crashed.

  • Add Graceful flag in updateState and mark it on ctx.Done so SIGTERM / manual restarts no longer bump RestartCount.
  • Run markStableAfter goroutine that zeros RestartCount once the new binary has been alive for 60s — caps the brittle early-life window.
  • Guard the defer with markGracefulIfCancelled so a panic in Run cannot mask itself as a clean shutdown.
  • Extract attemptRollback so the success path is unit-testable without os.Exit; clear LastUpdate/RestartCount/Graceful after a successful rollback to stop the supervisor from re-tripping rollback against the same binary, while preserving version history for post-mortem.
  • Drop the redundant os.Stat before replace; surface missing backup as os.ErrNotExist for callers.

3. 4594fd5 Probe pg_stat features instead of assuming schema

Fixes 42P01 / 42703 errors visible on PG18 hosts where switchToLargestDB lands in a database without CREATE EXTENSION pg_stat_statements.

  • Replace hardcoded schema assumptions with relHasColumn probes via to_regclass.
  • Skip pg_stat_wal.wal_write_time/wal_sync_time on PG18 (columns moved into pg_stat_io); the rest of pg_stat_wal keeps emitting.
  • Drop the total_time fallback in pg_stat_statements — the column was renamed in PG13/extension 1.8, so falling back triggered 42703 once the extension was actually loaded.
  • collectStatements early-returns when statementsTimeCol is empty so unsupported installs stop spamming errors each scrape.
  • collectStatWAL builds SELECT-list and Scan args dynamically so a single code path serves PG14..PG18.
  • Introduce versionPG18 and relPgStat* name constants so a typo or rename can't silently disable feature detection.

4. fb9bf8d Fix slow leak in postgres app-name cache

Memory grew linearly at ~70 MB/day on a Uteka host — root cause was the appNames map storing (queryid, application_name) pairs forever; each process restart with a new pid/uuid suffix minted a fresh entry.

  • Store last-seen time per pair (map[int64]map[string]time.Time) and add pruneAppNames to evict pairs older than appNamesTTL = 1h plus drop newly-empty queryid sub-maps.
  • Run pruneAppNames on its own 5-minute ticker alongside the existing 1s sample ticker; full walk of a 250k-entry map is ~14 ms, well below the existing 2s sample budget.
  • Memory now bounded by (active queryids × distinct app_names per hour), independent of agent uptime; topsrv.io caches the data server-side anyway.

Test plan

  • make fmt lint test — 0 issues, all packages green
  • make test-integration — full docker-compose stack (PG17 + nginx + angie + VictoriaMetrics) passes, 53 metric families collected
  • Manual repro against orbstack containers on PG15.17, PG17.8, PG18.3:
    • 0 ERROR lines across all three versions
    • topsrv_pg_query_calls_total emits on all three (20 samples each)
    • topsrv_pg_wal_* records/fpi/buffers_full present everywhere; wal_io_time correctly omitted on PG18
  • Verified the original user-reported failure mode (extension created in postgres DB only, app_db larger → switchToLargestDB lands without extension, then CREATE EXTENSION at runtime) no longer produces 42P01/42703
  • Smoke-check on a real Uteka-style host after deploy: RSS should plateau within an hour instead of growing linearly; crash-loop detection should not trigger on manual systemctl restart

- Add FieldAliases struct and DetectAliases helper that walks an
  nginx log_format string and resolves each semantic field (UA,
  Host, ServerName, RemoteAddr, Referer) to the actual name the
  parser will surface — nginx variable for text/gonx formats, the
  wrapping JSON key for JSON formats
- Layer DefaultAliases under auto-detected values and TOML
  override on top via Merge; empty result skips the field rather
  than collisions on the "" stub
- NewObserver and RequiredFields now take aliases instead of
  hardcoding nginx variable names so custom JSON keys ("ref"
  for $http_referer) and typo variants ($http_referrer) ship
  the right data
- registerLogCollector resolves aliases per tailed path; if
  resolutions diverge across paths we emit a config warning and
  use the first path's mapping (per-path Observer wiring is v2)
- 14 alias unit tests covering combined / key=value / logfmt /
  hybrid / JSON / referrer-typo / XFF-fallback / word-boundary
- Rename FieldAliases.Merge to WithFallback for clearer layering
  intent (override.WithFallback(detected).WithFallback(defaults))
- Memoise compiled regexes and per-format detection results to
  avoid redundant work across tailed log paths sharing a format
- Add realip_remote_addr to remote_addr candidate list so
  nginx/angie realip-style log_formats resolve correctly
- Log per-field alias provenance (override/detected/default) and
  canonical_path at startup so oncall can see why each field
  resolved as it did
- Add botlog_alias_mismatch to configWarnings help-string; point
  botlog_no_ua_field warning at [BotLogs.FieldAliases].UserAgent
  as the escape hatch
- Cover FieldAliases.String, word-boundary regression, realip
  variants, and resolveBotlogAliases override/mismatch scenarios
- Document auto-detect, alias candidates, and provenance log line
  in README; remove stale claim about ExtraLabels being dropped
- Add Graceful flag to updateState and mark it on ctx.Done so
  SIGTERM / manual restarts inside the post-update window no
  longer bump RestartCount and trip a false crash-loop rollback
- Run a markStableAfter goroutine that zeros RestartCount once
  the new binary has been alive for 60s, capping the brittle
  early-life window in which restarts can accumulate
- Guard the defer with markGracefulIfCancelled so a panic in Run
  cannot mask itself as a clean shutdown
- Extract attemptRollback so the success path is unit-testable
  without os.Exit; clear LastUpdate/RestartCount/Graceful after
  a successful rollback to stop the supervisor from re-tripping
  rollback against the same binary, while preserving version
  history for post-mortem
- Drop the os.Stat before replace; surface missing backup as
  os.ErrNotExist for callers
- Introduce backupPrefix const, replacing three "topsrv-"
  literals across backup, attemptRollback, and extractVersion
- Cover graceful skip, increment, threshold, outside-window
  reset, missing state, panic-safety, attemptRollback success
  and missing-backup, plus markStable reset and cancel paths
- Replace hardcoded schema assumptions with relHasColumn probes via
  to_regclass; fixes 42P01 when switchToLargestDB lands in a database
  where pg_stat_statements is not installed
- Skip pg_stat_wal.wal_write_time/wal_sync_time on PG18 (columns
  removed, moved into pg_stat_io); fixes 42703 on every scrape
- Drop the total_time fallback in pg_stat_statements; the column was
  renamed in PG13/extension 1.8, so falling back to it triggered
  42703 once the extension was actually loaded
- collectStatements early-returns when statementsTimeCol is empty so
  unsupported installs stop spamming errors each scrape
- collectStatWAL builds SELECT-list and Scan args dynamically so a
  single code path serves PG14..PG18
- Introduce versionPG18 and relPgStat* name constants so a typo or
  rename can't silently disable feature detection
- Verified on PG15/PG17/PG18 in orbstack: zero ERROR lines; query and
  WAL metrics emit correctly with wal_io_time omitted on PG18
- appNames accumulated (queryid, application_name) pairs forever
  on every sample tick: a process restart with a new pid/uuid in
  application_name minted a fresh entry; RSS grew ~70MB/day on
  busy hosts
- Store last-seen time per pair (map[int64]map[string]time.Time)
  and add pruneAppNames to evict pairs older than appNamesTTL=1h
  plus drop newly-empty queryid sub-maps
- Run pruneAppNames on its own 5-minute ticker alongside the
  existing 1s sample ticker; full walk of a 250k-entry map is
  ~14ms, well below the existing 2s sample budget
- Cover stale eviction, fresh retention, and empty sub-map cleanup
  in TestPruneAppNamesEvictsStale
@sergeyfast sergeyfast merged commit de23509 into master May 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant