refactor(github,gitlab): migrate to shared pipeline + datastore output#224
refactor(github,gitlab): migrate to shared pipeline + datastore output#224n0kovo wants to merge 23 commits intopraetorian-inc:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Hey @n0kovo, thanks for taking the time to submit this PR! Our team has been notified and someone will be picking this up shortly. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR centralizes CLI flag registration into three helpers in cmd/titus/flags.go (rules, output, pipeline), extracts the scan pipeline into cmd/titus/pipeline.go (pipelineOpts, runPipeline, upsertFinding), and updates scan entrypoints (scan.go, github.go, gitlab.go) to register shared flags and delegate scanning to runPipeline. Extraction limit handling and auto datastore naming were consolidated (auto names now use the .ds suffix). Tests were added/updated to verify flag registration, default output names, and pipeline filesystem behavior. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/titus/github.go`:
- Around line 124-126: The code still uses resolveAutoName (which returns a
"*.db" name) to handle scanOutputPath == ":auto:", causing legacy .db suffixes;
change this to use the new resolveAutoOutput behavior by replacing the
resolveAutoName call with resolveAutoOutput(githubOrg, githubUser, repo) (or
update resolveAutoName to return a ".ds" suffix) so that scanOutputPath resolves
to the new datastore default; ensure you reference scanOutputPath,
resolveAutoName, resolveAutoOutput, githubOrg, githubUser, and repo when making
the change.
In `@cmd/titus/pipeline.go`:
- Line 119: The errgroup-derived context returned by errgroup.WithContext is
canceled when g.Wait() returns, so stop reusing that canceled context for
post-Wait work: keep the original parent context (e.g., save the incoming ctx to
parentCtx) and call g, gctx := errgroup.WithContext(parentCtx); use gctx for
spawning goroutines and cancellation inside those goroutines, but after g.Wait()
return use parentCtx when calling drainTimedOutMatches and for the cancellation
check (references: errgroup.WithContext, g, gctx, g.Wait(),
drainTimedOutMatches) so post-Wait operations run under the original
non-canceled parent context.
In `@cmd/titus/scan.go`:
- Around line 126-131: The call to runPipeline is hardcoding TokenEnvVar:
"GITHUB_TOKEN" which prevents the configured accessibility token from being
forwarded; change the TokenEnvVar assignment in the pipelineOpts passed to
runPipeline to use the configured accessibility token environment variable (the
variable used elsewhere for accessibility detection) instead of the literal
"GITHUB_TOKEN" so runPipeline and scoring will receive the proper token; locate
the runPipeline(...) call and modify the pipelineOpts.TokenEnvVar field rather
than altering runPipeline/pipelineOpts types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0da9087f-2fc5-4587-bb8b-aaa9a87ea744
📒 Files selected for processing (8)
cmd/titus/flags.gocmd/titus/github.gocmd/titus/github_test.gocmd/titus/gitlab.gocmd/titus/gitlab_test.gocmd/titus/pipeline.gocmd/titus/scan.gocmd/titus/scan_test.go
- Pass parent ctx to drainTimedOutMatches; the errgroup-derived ctx is canceled by g.Wait() and would short-circuit engine.Score for retry-pass findings. Pre-existing in runScan; fixed during extraction. - Extract upsertFinding helper shared by runPipeline workers and drainTimedOutMatches so scoring + accessibility semantics cannot drift. - TestRunPipeline_FilesystemEquivalence now snapshots and restores every pipeline global via t.Cleanup so it cannot leak state into sibling tests. - Refresh drainTimedOutMatches doc to reflect single caller (runPipeline).
9e8ff46 to
e727952
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/titus/scan.go (1)
147-159:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCount retried matches independently from finding dedupe.
AddMatchruns for every timeout replay, butmatchCountonly increments whenupsertFindinginserts a new finding. Retries that map to an existing finding are still stored as matches, so the final totals under-report them.Suggested fix
for _, match := range retryMatches { if err := tx.AddMatch(match); err != nil { return fmt.Errorf("storing retry match: %w", err) } + matchCount.Add(1) added, err := upsertFinding(ctx, tx, match, ruleMap, engine, resolvedAccess) if err != nil { return err } if added { findingCount.Add(1) - matchCount.Add(1) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/titus/scan.go` around lines 147 - 159, The current loop in ExecBatch over retryMatches only increments matchCount when upsertFinding returned added, which under‑reports stored matches; update the logic in the ExecBatch closure (the loop that calls tx.AddMatch, upsertFinding, and uses matchCount and findingCount) so that matchCount.Add(1) is called immediately after a successful tx.AddMatch (or whenever a match is stored) regardless of whether upsertFinding returned added, while still only incrementing findingCount when added is true and preserving existing error handling for tx.AddMatch and upsertFinding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/titus/pipeline.go`:
- Around line 94-98: runPipeline currently reloads credentials from
os.Getenv(opts.TokenEnvVar) and calls ResolveAccessibility without the
caller-provided token, causing unauthenticated accessibility checks; update the
pipeline options to carry the resolved token (add Token to pipelineOpts) and
change runPipeline to use that Token when calling ResolveAccessibility (fall
back to env var only if Token is empty), and update the GitHub/GitLab callers
that already resolve a token to set pipelineOpts.Token = token so the resolved
CLI token is forwarded into accessibility detection.
---
Outside diff comments:
In `@cmd/titus/scan.go`:
- Around line 147-159: The current loop in ExecBatch over retryMatches only
increments matchCount when upsertFinding returned added, which under‑reports
stored matches; update the logic in the ExecBatch closure (the loop that calls
tx.AddMatch, upsertFinding, and uses matchCount and findingCount) so that
matchCount.Add(1) is called immediately after a successful tx.AddMatch (or
whenever a match is stored) regardless of whether upsertFinding returned added,
while still only incrementing findingCount when added is true and preserving
existing error handling for tx.AddMatch and upsertFinding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c7e1e4f0-4355-4c3e-b316-4a60b380435c
📒 Files selected for processing (7)
cmd/titus/github.gocmd/titus/github_test.gocmd/titus/gitlab.gocmd/titus/gitlab_test.gocmd/titus/pipeline.gocmd/titus/scan.gocmd/titus/scan_test.go
✅ Files skipped from review due to trivial changes (1)
- cmd/titus/github_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- cmd/titus/gitlab_test.go
- cmd/titus/scan_test.go
- cmd/titus/github.go
- Replace pipelineOpts.TokenEnvVar with Token (resolved value). Callers that already resolve a token from --flag-or-env now forward it; runPipeline no longer second-guesses by re-reading os.Getenv. Without this, a github/gitlab scan with --token but no env var would fall through to unauthenticated accessibility detection. - On context.Canceled from g.Wait(), return nil instead of running drain with a (possibly canceled) parent ctx, which produced a misleading "retrying timed-out blobs" error on Ctrl-C.
e727952 to
8012267
Compare
Filesystem targets have no remote, so accessibility detection should not silently pick up $GITHUB_TOKEN from the environment. Mirrors runS3Scan's empty-token behavior.
8012267 to
7587d9a
Compare
…atches - runRepoScan now passes ExtractArchives / ExtractLimits to the clone enumerator so --extract-archives and the extract-limit / sqlite-row flags apply to remote-URL scans, matching the filesystem and S3 paths. - Extracted the duplicate ExtractionLimits resolution into resolveExtractionLimits() shared by createEnumerator, runS3Scan, and runRepoScan. - Filesystem target check distinguishes ErrNotExist from other Stat failures (permissions, I/O) and wraps the underlying error. - drainTimedOutMatches increments matchCount per retried match unconditionally; only findingCount stays gated on whether a new finding was inserted (matches that merge into existing findings still count toward the match total).
7587d9a to
bd3d2e3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/titus/gitlab.go (1)
104-113:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGitLab scan still drops most shared pipeline flags.
Both enumerator configs only forward
MaxFileSizehere (plusIgnoreFileon the clone path), so--extract-archives,--extract-max-*,--sqlite-row-limit, and--readersare no-ops for GitLab scans, and--ignore-fileis skipped entirely in--no-clonemode. SinceregisterGitLabFlagsadvertises those flags on both commands, this is a silent behavior gap.Also applies to: 134-137
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/titus/gitlab.go` around lines 104 - 113, The GitLab enumerator construction (enum.NewGitLabEnumerator with enum.GitLabConfig) only forwards MaxFileSize and omits other shared flags and IgnoreFile, causing flags like --extract-archives, --extract-max-*, --sqlite-row-limit, --readers and --ignore-file (in no-clone mode) to be ignored; update the enum.GitLabConfig and the Config passed into NewGitLabEnumerator (both call sites) to populate the full set of shared fields (ExtractArchives, ExtractMaxSize/ExtractMaxFiles as applicable, SqliteRowLimit, Readers, and IgnoreFile when --no-clone applies) so these flags are forwarded to the GitLab enumerator and respected at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/titus/scan.go`:
- Around line 496-501: The clone enumerator invocation only sets MaxFileSize and
IgnoreFile and therefore drops the shared extraction flags; update the
enum.NewCloneEnumerator call to forward the global extraction and reader
settings used by addPipelineFlags(scanCmd) (e.g. pass the variables backing
--extract-archives, --extract-max-*, --sqlite-row-limit, and --readers) into the
enum.Config passed to NewCloneEnumerator (add the corresponding fields on the
Config literal alongside MaxFileSize and IgnoreFile so CloneEnumerator receives
ExtractArchives, extract-max values, SQLiteRowLimit and Readers).
- Around line 152-159: The code only increments matchCount when
upsertFinding(...) returns added=true, undercounting retried matches that call
tx.AddMatch(match) but don't create a new finding; move or add matchCount.Add(1)
so it increments for every match processed (or whenever tx.AddMatch(match)
succeeds) and leave findingCount.Add(1) only when upsertFinding returns
added=true; update the block around tx.AddMatch(match) / upsertFinding(...) so
matchCount is incremented unconditionally for the match case and findingCount
only on new insertions.
---
Outside diff comments:
In `@cmd/titus/gitlab.go`:
- Around line 104-113: The GitLab enumerator construction
(enum.NewGitLabEnumerator with enum.GitLabConfig) only forwards MaxFileSize and
omits other shared flags and IgnoreFile, causing flags like --extract-archives,
--extract-max-*, --sqlite-row-limit, --readers and --ignore-file (in no-clone
mode) to be ignored; update the enum.GitLabConfig and the Config passed into
NewGitLabEnumerator (both call sites) to populate the full set of shared fields
(ExtractArchives, ExtractMaxSize/ExtractMaxFiles as applicable, SqliteRowLimit,
Readers, and IgnoreFile when --no-clone applies) so these flags are forwarded to
the GitLab enumerator and respected at runtime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d59bf650-8a6c-4ad6-aedf-4ee1d46b945e
📒 Files selected for processing (5)
cmd/titus/github.gocmd/titus/github_test.gocmd/titus/gitlab.gocmd/titus/gitlab_test.gocmd/titus/scan.go
✅ Files skipped from review due to trivial changes (1)
- cmd/titus/gitlab_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/titus/github_test.go
- cmd/titus/github.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/titus/gitlab.go`:
- Around line 138-143: The CloneEnumerator config is missing the NumReaders
field so the --readers flag (scanReaders) is ignored; update the
enum.NewCloneEnumerator call that constructs cloneEnum to include NumReaders:
scanReaders in the enum.Config passed in (same pattern as in runRepoScan),
referencing enum.NewCloneEnumerator, enum.Config, cloneEnum and the scanReaders
variable.
In `@cmd/titus/scan.go`:
- Around line 512-519: The --readers flag isn't propagated into the clone and S3
enumerators: update the enum.Config passed to CloneEnumerator (in runRepoScan
where CloneEnumerator is created) and to S3Enumerator (in runS3Scan where
S3Enumerator is created) to include NumReaders: scanReaders so the reader count
from createEnumerator/scanReaders is honored; ensure the field name is
NumReaders and the value comes from the existing scanReaders variable used
elsewhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f9ae0809-7499-4eee-a78e-b8a481b8def2
📒 Files selected for processing (5)
cmd/titus/github.gocmd/titus/github_test.gocmd/titus/gitlab.gocmd/titus/gitlab_test.gocmd/titus/scan.go
✅ Files skipped from review due to trivial changes (2)
- cmd/titus/github_test.go
- cmd/titus/gitlab_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/titus/github.go
createEnumerator already passes NumReaders for filesystem; the clone and S3 paths were silently ignoring --readers.
bd3d2e3 to
03adeb6
Compare
…xtractArchivesFlag in tests pipelineOpts now carries a resolved Accessibility value instead of a raw token. Each caller decides how to map --accessibility=auto to a concrete value: - runScan filesystem: ResolveAccessibility against the local path (env-fallback for GITHUB_TOKEN preserved). - runRepoScan, runS3Scan: existing semantics preserved. This unblocks per-caller overrides for the GitHub/GitLab subcommands, which have no local checkout to inspect and would otherwise auto-resolve to Private (-25 penalty on every finding). Also extends resetPipelineGlobals to snapshot/restore scanGit and scanExtractArchivesFlag so future tests can't leak state through them.
resolveAutoName returned "<name>.db" even after the github/gitlab default flipped to titus.ds in this PR, so --output :auto: produced legacy .db directories. Update suffix to .ds and refresh TestResolveAutoName.
github/gitlab resolve --token (with env fallback); now pass that resolved value into runPipeline instead of relying on the pipeline to re-read the env var. Without this, --token without a matching env var produced unauthenticated accessibility detection.
Mirrors the runRepoScan fix from praetorian-inc#223: clone-based GitHub and GitLab scans now honor --extract-archives and the extraction / sqlite-row limit flags via resolveExtractionLimits().
… scans Remote-clone subcommands have no local checkout for ResolveAccessibility's git-remote inspection to read, so 'auto' would conservatively fall back to Private and silently apply a -25 penalty to every finding. Default 'auto' to Public for these subcommands; explicit --accessibility=private still takes precedence.
03adeb6 to
b311989
Compare
Summary
Stacked on #223. Migrates
titus githubandtitus gitlabto the sharedrunPipelineintroduced there. Both subcommands now inherit parallel workers, scoring, validation, accessibility detection, retry drain, batched writes,--store-blobs, clone cache, and the full extract/score flag set.Closes the missing
DrainTimedOut()call that previously caused the matcher's retry queue to grow unbounded across an org scan — a likely cause of OOM kills on large organizations (#TODO if there's a tracking issue).Drops the duplicated flag registration on
gitlabCmd/gitlabScanCmdandgithubCmd/githubScanCmd. Each subcommand now wires init() through a singleregisterGitHubFlags/registerGitLabFlagshelper that calls the sharedaddRulesFlags+addOutputFlags+addPipelineFlags. No more flag drift between subcommands.Breaking change
Default
--outputfortitus githubandtitus gitlabflips fromtitus.db(sqlite file) totitus.ds(datastore directory, NoseyParker-style). Existing.dbfiles remain readable viatitus report. Pass--output titus.dbto keep the old name (will create a directory namedtitus.db, not a file).--max-file-sizeand--context-linespreviously hardcoded at 10 MiB / 3 lines for github/gitlab are now flags inherited from the shared pipeline and respect user overrides.Note for reviewers
This PR is stacked on #223. Please review #223 first; the diff here makes most sense once that's in. If reviewing in parallel, the relevant new symbols (
runPipeline,pipelineOpts,addRulesFlags,addOutputFlags,addPipelineFlags) all come from #223.Test plan
go test ./cmd/titus/ -race -count=1passes.titus.dsdefault + pipeline-flag inheritance (workers, validate, accessibility, store-blobs, ignore) on bothgithub scanandgitlab scan.titus github --helpshows the full inherited pipeline flag set;titus gitlab --helplikewise.