Collector hardening: aliases, rollback, PG18, leak#11
Merged
Conversation
- Add FieldAliases struct and DetectAliases helper that walks an
nginx log_format string and resolves each semantic field (UA,
Host, ServerName, RemoteAddr, Referer) to the actual name the
parser will surface — nginx variable for text/gonx formats, the
wrapping JSON key for JSON formats
- Layer DefaultAliases under auto-detected values and TOML
override on top via Merge; empty result skips the field rather
than collisions on the "" stub
- NewObserver and RequiredFields now take aliases instead of
hardcoding nginx variable names so custom JSON keys ("ref"
for $http_referer) and typo variants ($http_referrer) ship
the right data
- registerLogCollector resolves aliases per tailed path; if
resolutions diverge across paths we emit a config warning and
use the first path's mapping (per-path Observer wiring is v2)
- 14 alias unit tests covering combined / key=value / logfmt /
hybrid / JSON / referrer-typo / XFF-fallback / word-boundary
- Rename FieldAliases.Merge to WithFallback for clearer layering intent (override.WithFallback(detected).WithFallback(defaults)) - Memoise compiled regexes and per-format detection results to avoid redundant work across tailed log paths sharing a format - Add realip_remote_addr to remote_addr candidate list so nginx/angie realip-style log_formats resolve correctly - Log per-field alias provenance (override/detected/default) and canonical_path at startup so oncall can see why each field resolved as it did - Add botlog_alias_mismatch to configWarnings help-string; point botlog_no_ua_field warning at [BotLogs.FieldAliases].UserAgent as the escape hatch - Cover FieldAliases.String, word-boundary regression, realip variants, and resolveBotlogAliases override/mismatch scenarios - Document auto-detect, alias candidates, and provenance log line in README; remove stale claim about ExtraLabels being dropped
- Add Graceful flag to updateState and mark it on ctx.Done so SIGTERM / manual restarts inside the post-update window no longer bump RestartCount and trip a false crash-loop rollback - Run a markStableAfter goroutine that zeros RestartCount once the new binary has been alive for 60s, capping the brittle early-life window in which restarts can accumulate - Guard the defer with markGracefulIfCancelled so a panic in Run cannot mask itself as a clean shutdown - Extract attemptRollback so the success path is unit-testable without os.Exit; clear LastUpdate/RestartCount/Graceful after a successful rollback to stop the supervisor from re-tripping rollback against the same binary, while preserving version history for post-mortem - Drop the os.Stat before replace; surface missing backup as os.ErrNotExist for callers - Introduce backupPrefix const, replacing three "topsrv-" literals across backup, attemptRollback, and extractVersion - Cover graceful skip, increment, threshold, outside-window reset, missing state, panic-safety, attemptRollback success and missing-backup, plus markStable reset and cancel paths
- Replace hardcoded schema assumptions with relHasColumn probes via to_regclass; fixes 42P01 when switchToLargestDB lands in a database where pg_stat_statements is not installed - Skip pg_stat_wal.wal_write_time/wal_sync_time on PG18 (columns removed, moved into pg_stat_io); fixes 42703 on every scrape - Drop the total_time fallback in pg_stat_statements; the column was renamed in PG13/extension 1.8, so falling back to it triggered 42703 once the extension was actually loaded - collectStatements early-returns when statementsTimeCol is empty so unsupported installs stop spamming errors each scrape - collectStatWAL builds SELECT-list and Scan args dynamically so a single code path serves PG14..PG18 - Introduce versionPG18 and relPgStat* name constants so a typo or rename can't silently disable feature detection - Verified on PG15/PG17/PG18 in orbstack: zero ERROR lines; query and WAL metrics emit correctly with wal_io_time omitted on PG18
- appNames accumulated (queryid, application_name) pairs forever on every sample tick: a process restart with a new pid/uuid in application_name minted a fresh entry; RSS grew ~70MB/day on busy hosts - Store last-seen time per pair (map[int64]map[string]time.Time) and add pruneAppNames to evict pairs older than appNamesTTL=1h plus drop newly-empty queryid sub-maps - Run pruneAppNames on its own 5-minute ticker alongside the existing 1s sample ticker; full walk of a 250k-entry map is ~14ms, well below the existing 2s sample budget - Cover stale eviction, fresh retention, and empty sub-map cleanup in TestPruneAppNamesEvictsStale
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four independent fixes batched on one branch; each commit stands on its own and is safe to revert in isolation.
1.
b61f64aRefine botlog field alias resolutionFieldAliases.Merge→WithFallbackso the call site (override.WithFallback(detected).WithFallback(defaults)) reads left-to-right.realip_remote_addrto remote_addr candidate list — Angie/nginx realip-style log_formats now resolve correctly instead of falling back to peer IP.botlog_alias_mismatchtotopsrv_collector_config_warnings_totalhelp-string; pointbotlog_no_ua_fieldwarning at[BotLogs.FieldAliases].UserAgentas the escape hatch.docs/metrics.mdto describe auto-detect, alias candidates, and the new provenance log line.2.
2de76f6Skip non-crash restarts in update rollback checkFixes a real false-positive: 7 manual restarts inside the 5-min post-update window had tripped a rollback to v0.0.21 even though nothing had actually crashed.
Gracefulflag inupdateStateand mark it onctx.Doneso SIGTERM / manual restarts no longer bumpRestartCount.markStableAftergoroutine that zerosRestartCountonce the new binary has been alive for 60s — caps the brittle early-life window.markGracefulIfCancelledso a panic inRuncannot mask itself as a clean shutdown.attemptRollbackso the success path is unit-testable withoutos.Exit; clearLastUpdate/RestartCount/Gracefulafter a successful rollback to stop the supervisor from re-tripping rollback against the same binary, while preserving version history for post-mortem.os.Statbeforereplace; surface missing backup asos.ErrNotExistfor callers.3.
4594fd5Probe pg_stat features instead of assuming schemaFixes 42P01 / 42703 errors visible on PG18 hosts where
switchToLargestDBlands in a database withoutCREATE EXTENSION pg_stat_statements.relHasColumnprobes viato_regclass.pg_stat_wal.wal_write_time/wal_sync_timeon PG18 (columns moved intopg_stat_io); the rest ofpg_stat_walkeeps emitting.total_timefallback inpg_stat_statements— the column was renamed in PG13/extension 1.8, so falling back triggered 42703 once the extension was actually loaded.collectStatementsearly-returns whenstatementsTimeColis empty so unsupported installs stop spamming errors each scrape.collectStatWALbuilds SELECT-list and Scan args dynamically so a single code path serves PG14..PG18.versionPG18andrelPgStat*name constants so a typo or rename can't silently disable feature detection.4.
fb9bf8dFix slow leak in postgres app-name cacheMemory grew linearly at ~70 MB/day on a Uteka host — root cause was the
appNamesmap storing(queryid, application_name)pairs forever; each process restart with a new pid/uuid suffix minted a fresh entry.map[int64]map[string]time.Time) and addpruneAppNamesto evict pairs older thanappNamesTTL = 1hplus drop newly-empty queryid sub-maps.pruneAppNameson its own 5-minute ticker alongside the existing 1s sample ticker; full walk of a 250k-entry map is ~14 ms, well below the existing 2s sample budget.(active queryids × distinct app_names per hour), independent of agent uptime; topsrv.io caches the data server-side anyway.Test plan
make fmt lint test— 0 issues, all packages greenmake test-integration— full docker-compose stack (PG17 + nginx + angie + VictoriaMetrics) passes, 53 metric families collectedtopsrv_pg_query_calls_totalemits on all three (20 samples each)topsrv_pg_wal_*records/fpi/buffers_full present everywhere;wal_io_timecorrectly omitted on PG18postgresDB only,app_dblarger → switchToLargestDB lands without extension, thenCREATE EXTENSIONat runtime) no longer produces 42P01/42703systemctl restart