Skip to content

refactor(github,gitlab): migrate to shared pipeline + datastore output#224

Open
n0kovo wants to merge 23 commits intopraetorian-inc:mainfrom
n0kovo:refactor/github-gitlab-pipeline
Open

refactor(github,gitlab): migrate to shared pipeline + datastore output#224
n0kovo wants to merge 23 commits intopraetorian-inc:mainfrom
n0kovo:refactor/github-gitlab-pipeline

Conversation

@n0kovo
Copy link
Copy Markdown
Contributor

@n0kovo n0kovo commented May 5, 2026

Summary

Stacked on #223. Migrates titus github and titus gitlab to the shared runPipeline introduced there. Both subcommands now inherit parallel workers, scoring, validation, accessibility detection, retry drain, batched writes, --store-blobs, clone cache, and the full extract/score flag set.

Closes the missing DrainTimedOut() call that previously caused the matcher's retry queue to grow unbounded across an org scan — a likely cause of OOM kills on large organizations (#TODO if there's a tracking issue).

Drops the duplicated flag registration on gitlabCmd/gitlabScanCmd and githubCmd/githubScanCmd. Each subcommand now wires init() through a single registerGitHubFlags / registerGitLabFlags helper that calls the shared addRulesFlags + addOutputFlags + addPipelineFlags. No more flag drift between subcommands.

Breaking change

Default --output for titus github and titus gitlab flips from titus.db (sqlite file) to titus.ds (datastore directory, NoseyParker-style). Existing .db files remain readable via titus report. Pass --output titus.db to keep the old name (will create a directory named titus.db, not a file).

--max-file-size and --context-lines previously hardcoded at 10 MiB / 3 lines for github/gitlab are now flags inherited from the shared pipeline and respect user overrides.

Note for reviewers

This PR is stacked on #223. Please review #223 first; the diff here makes most sense once that's in. If reviewing in parallel, the relevant new symbols (runPipeline, pipelineOpts, addRulesFlags, addOutputFlags, addPipelineFlags) all come from #223.

Test plan

  • go test ./cmd/titus/ -race -count=1 passes.
  • New tests assert titus.ds default + pipeline-flag inheritance (workers, validate, accessibility, store-blobs, ignore) on both github scan and gitlab scan.
  • Manual smoke: titus github --help shows the full inherited pipeline flag set; titus gitlab --help likewise.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Hey @n0kovo, thanks for taking the time to submit this PR! Our team has been notified and someone will be picking this up shortly.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR centralizes CLI flag registration into three helpers in cmd/titus/flags.go (rules, output, pipeline), extracts the scan pipeline into cmd/titus/pipeline.go (pipelineOpts, runPipeline, upsertFinding), and updates scan entrypoints (scan.go, github.go, gitlab.go) to register shared flags and delegate scanning to runPipeline. Extraction limit handling and auto datastore naming were consolidated (auto names now use the .ds suffix). Tests were added/updated to verify flag registration, default output names, and pipeline filesystem behavior.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/titus/github.go`:
- Around line 124-126: The code still uses resolveAutoName (which returns a
"*.db" name) to handle scanOutputPath == ":auto:", causing legacy .db suffixes;
change this to use the new resolveAutoOutput behavior by replacing the
resolveAutoName call with resolveAutoOutput(githubOrg, githubUser, repo) (or
update resolveAutoName to return a ".ds" suffix) so that scanOutputPath resolves
to the new datastore default; ensure you reference scanOutputPath,
resolveAutoName, resolveAutoOutput, githubOrg, githubUser, and repo when making
the change.

In `@cmd/titus/pipeline.go`:
- Line 119: The errgroup-derived context returned by errgroup.WithContext is
canceled when g.Wait() returns, so stop reusing that canceled context for
post-Wait work: keep the original parent context (e.g., save the incoming ctx to
parentCtx) and call g, gctx := errgroup.WithContext(parentCtx); use gctx for
spawning goroutines and cancellation inside those goroutines, but after g.Wait()
return use parentCtx when calling drainTimedOutMatches and for the cancellation
check (references: errgroup.WithContext, g, gctx, g.Wait(),
drainTimedOutMatches) so post-Wait operations run under the original
non-canceled parent context.

In `@cmd/titus/scan.go`:
- Around line 126-131: The call to runPipeline is hardcoding TokenEnvVar:
"GITHUB_TOKEN" which prevents the configured accessibility token from being
forwarded; change the TokenEnvVar assignment in the pipelineOpts passed to
runPipeline to use the configured accessibility token environment variable (the
variable used elsewhere for accessibility detection) instead of the literal
"GITHUB_TOKEN" so runPipeline and scoring will receive the proper token; locate
the runPipeline(...) call and modify the pipelineOpts.TokenEnvVar field rather
than altering runPipeline/pipelineOpts types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0da9087f-2fc5-4587-bb8b-aaa9a87ea744

📥 Commits

Reviewing files that changed from the base of the PR and between c1aa36a and 9e8ff46.

📒 Files selected for processing (8)
  • cmd/titus/flags.go
  • cmd/titus/github.go
  • cmd/titus/github_test.go
  • cmd/titus/gitlab.go
  • cmd/titus/gitlab_test.go
  • cmd/titus/pipeline.go
  • cmd/titus/scan.go
  • cmd/titus/scan_test.go

Comment thread cmd/titus/github.go
Comment thread cmd/titus/pipeline.go Outdated
Comment thread cmd/titus/scan.go
- Pass parent ctx to drainTimedOutMatches; the errgroup-derived ctx is
  canceled by g.Wait() and would short-circuit engine.Score for retry-pass
  findings. Pre-existing in runScan; fixed during extraction.
- Extract upsertFinding helper shared by runPipeline workers and
  drainTimedOutMatches so scoring + accessibility semantics cannot drift.
- TestRunPipeline_FilesystemEquivalence now snapshots and restores every
  pipeline global via t.Cleanup so it cannot leak state into sibling tests.
- Refresh drainTimedOutMatches doc to reflect single caller (runPipeline).
@n0kovo n0kovo force-pushed the refactor/github-gitlab-pipeline branch from 9e8ff46 to e727952 Compare May 5, 2026 19:28
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/titus/scan.go (1)

147-159: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Count retried matches independently from finding dedupe.

AddMatch runs for every timeout replay, but matchCount only increments when upsertFinding inserts a new finding. Retries that map to an existing finding are still stored as matches, so the final totals under-report them.

Suggested fix
 	for _, match := range retryMatches {
 		if err := tx.AddMatch(match); err != nil {
 			return fmt.Errorf("storing retry match: %w", err)
 		}
+		matchCount.Add(1)
 		added, err := upsertFinding(ctx, tx, match, ruleMap, engine, resolvedAccess)
 		if err != nil {
 			return err
 		}
 		if added {
 			findingCount.Add(1)
-			matchCount.Add(1)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/titus/scan.go` around lines 147 - 159, The current loop in ExecBatch over
retryMatches only increments matchCount when upsertFinding returned added, which
under‑reports stored matches; update the logic in the ExecBatch closure (the
loop that calls tx.AddMatch, upsertFinding, and uses matchCount and
findingCount) so that matchCount.Add(1) is called immediately after a successful
tx.AddMatch (or whenever a match is stored) regardless of whether upsertFinding
returned added, while still only incrementing findingCount when added is true
and preserving existing error handling for tx.AddMatch and upsertFinding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/titus/pipeline.go`:
- Around line 94-98: runPipeline currently reloads credentials from
os.Getenv(opts.TokenEnvVar) and calls ResolveAccessibility without the
caller-provided token, causing unauthenticated accessibility checks; update the
pipeline options to carry the resolved token (add Token to pipelineOpts) and
change runPipeline to use that Token when calling ResolveAccessibility (fall
back to env var only if Token is empty), and update the GitHub/GitLab callers
that already resolve a token to set pipelineOpts.Token = token so the resolved
CLI token is forwarded into accessibility detection.

---

Outside diff comments:
In `@cmd/titus/scan.go`:
- Around line 147-159: The current loop in ExecBatch over retryMatches only
increments matchCount when upsertFinding returned added, which under‑reports
stored matches; update the logic in the ExecBatch closure (the loop that calls
tx.AddMatch, upsertFinding, and uses matchCount and findingCount) so that
matchCount.Add(1) is called immediately after a successful tx.AddMatch (or
whenever a match is stored) regardless of whether upsertFinding returned added,
while still only incrementing findingCount when added is true and preserving
existing error handling for tx.AddMatch and upsertFinding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c7e1e4f0-4355-4c3e-b316-4a60b380435c

📥 Commits

Reviewing files that changed from the base of the PR and between 9e8ff46 and e727952.

📒 Files selected for processing (7)
  • cmd/titus/github.go
  • cmd/titus/github_test.go
  • cmd/titus/gitlab.go
  • cmd/titus/gitlab_test.go
  • cmd/titus/pipeline.go
  • cmd/titus/scan.go
  • cmd/titus/scan_test.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/titus/github_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/titus/gitlab_test.go
  • cmd/titus/scan_test.go
  • cmd/titus/github.go

Comment thread cmd/titus/pipeline.go Outdated
- Replace pipelineOpts.TokenEnvVar with Token (resolved value). Callers that
  already resolve a token from --flag-or-env now forward it; runPipeline no
  longer second-guesses by re-reading os.Getenv. Without this, a github/gitlab
  scan with --token but no env var would fall through to unauthenticated
  accessibility detection.
- On context.Canceled from g.Wait(), return nil instead of running drain with
  a (possibly canceled) parent ctx, which produced a misleading
  "retrying timed-out blobs" error on Ctrl-C.
@n0kovo n0kovo force-pushed the refactor/github-gitlab-pipeline branch from e727952 to 8012267 Compare May 5, 2026 19:48
Filesystem targets have no remote, so accessibility detection should
not silently pick up $GITHUB_TOKEN from the environment. Mirrors
runS3Scan's empty-token behavior.
@n0kovo n0kovo force-pushed the refactor/github-gitlab-pipeline branch from 8012267 to 7587d9a Compare May 5, 2026 19:54
…atches

- runRepoScan now passes ExtractArchives / ExtractLimits to the clone
  enumerator so --extract-archives and the extract-limit / sqlite-row
  flags apply to remote-URL scans, matching the filesystem and S3 paths.
- Extracted the duplicate ExtractionLimits resolution into
  resolveExtractionLimits() shared by createEnumerator, runS3Scan, and
  runRepoScan.
- Filesystem target check distinguishes ErrNotExist from other Stat
  failures (permissions, I/O) and wraps the underlying error.
- drainTimedOutMatches increments matchCount per retried match
  unconditionally; only findingCount stays gated on whether a new
  finding was inserted (matches that merge into existing findings still
  count toward the match total).
@n0kovo n0kovo force-pushed the refactor/github-gitlab-pipeline branch from 7587d9a to bd3d2e3 Compare May 5, 2026 19:58
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/titus/gitlab.go (1)

104-113: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

GitLab scan still drops most shared pipeline flags.

Both enumerator configs only forward MaxFileSize here (plus IgnoreFile on the clone path), so --extract-archives, --extract-max-*, --sqlite-row-limit, and --readers are no-ops for GitLab scans, and --ignore-file is skipped entirely in --no-clone mode. Since registerGitLabFlags advertises those flags on both commands, this is a silent behavior gap.

Also applies to: 134-137

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/titus/gitlab.go` around lines 104 - 113, The GitLab enumerator
construction (enum.NewGitLabEnumerator with enum.GitLabConfig) only forwards
MaxFileSize and omits other shared flags and IgnoreFile, causing flags like
--extract-archives, --extract-max-*, --sqlite-row-limit, --readers and
--ignore-file (in no-clone mode) to be ignored; update the enum.GitLabConfig and
the Config passed into NewGitLabEnumerator (both call sites) to populate the
full set of shared fields (ExtractArchives, ExtractMaxSize/ExtractMaxFiles as
applicable, SqliteRowLimit, Readers, and IgnoreFile when --no-clone applies) so
these flags are forwarded to the GitLab enumerator and respected at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/titus/scan.go`:
- Around line 496-501: The clone enumerator invocation only sets MaxFileSize and
IgnoreFile and therefore drops the shared extraction flags; update the
enum.NewCloneEnumerator call to forward the global extraction and reader
settings used by addPipelineFlags(scanCmd) (e.g. pass the variables backing
--extract-archives, --extract-max-*, --sqlite-row-limit, and --readers) into the
enum.Config passed to NewCloneEnumerator (add the corresponding fields on the
Config literal alongside MaxFileSize and IgnoreFile so CloneEnumerator receives
ExtractArchives, extract-max values, SQLiteRowLimit and Readers).
- Around line 152-159: The code only increments matchCount when
upsertFinding(...) returns added=true, undercounting retried matches that call
tx.AddMatch(match) but don't create a new finding; move or add matchCount.Add(1)
so it increments for every match processed (or whenever tx.AddMatch(match)
succeeds) and leave findingCount.Add(1) only when upsertFinding returns
added=true; update the block around tx.AddMatch(match) / upsertFinding(...) so
matchCount is incremented unconditionally for the match case and findingCount
only on new insertions.

---

Outside diff comments:
In `@cmd/titus/gitlab.go`:
- Around line 104-113: The GitLab enumerator construction
(enum.NewGitLabEnumerator with enum.GitLabConfig) only forwards MaxFileSize and
omits other shared flags and IgnoreFile, causing flags like --extract-archives,
--extract-max-*, --sqlite-row-limit, --readers and --ignore-file (in no-clone
mode) to be ignored; update the enum.GitLabConfig and the Config passed into
NewGitLabEnumerator (both call sites) to populate the full set of shared fields
(ExtractArchives, ExtractMaxSize/ExtractMaxFiles as applicable, SqliteRowLimit,
Readers, and IgnoreFile when --no-clone applies) so these flags are forwarded to
the GitLab enumerator and respected at runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d59bf650-8a6c-4ad6-aedf-4ee1d46b945e

📥 Commits

Reviewing files that changed from the base of the PR and between 8012267 and 7587d9a.

📒 Files selected for processing (5)
  • cmd/titus/github.go
  • cmd/titus/github_test.go
  • cmd/titus/gitlab.go
  • cmd/titus/gitlab_test.go
  • cmd/titus/scan.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/titus/gitlab_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/titus/github_test.go
  • cmd/titus/github.go

Comment thread cmd/titus/scan.go
Comment thread cmd/titus/scan.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/titus/gitlab.go`:
- Around line 138-143: The CloneEnumerator config is missing the NumReaders
field so the --readers flag (scanReaders) is ignored; update the
enum.NewCloneEnumerator call that constructs cloneEnum to include NumReaders:
scanReaders in the enum.Config passed in (same pattern as in runRepoScan),
referencing enum.NewCloneEnumerator, enum.Config, cloneEnum and the scanReaders
variable.

In `@cmd/titus/scan.go`:
- Around line 512-519: The --readers flag isn't propagated into the clone and S3
enumerators: update the enum.Config passed to CloneEnumerator (in runRepoScan
where CloneEnumerator is created) and to S3Enumerator (in runS3Scan where
S3Enumerator is created) to include NumReaders: scanReaders so the reader count
from createEnumerator/scanReaders is honored; ensure the field name is
NumReaders and the value comes from the existing scanReaders variable used
elsewhere.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f9ae0809-7499-4eee-a78e-b8a481b8def2

📥 Commits

Reviewing files that changed from the base of the PR and between 7587d9a and bd3d2e3.

📒 Files selected for processing (5)
  • cmd/titus/github.go
  • cmd/titus/github_test.go
  • cmd/titus/gitlab.go
  • cmd/titus/gitlab_test.go
  • cmd/titus/scan.go
✅ Files skipped from review due to trivial changes (2)
  • cmd/titus/github_test.go
  • cmd/titus/gitlab_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/titus/github.go

Comment thread cmd/titus/gitlab.go
Comment thread cmd/titus/scan.go
createEnumerator already passes NumReaders for filesystem; the clone
and S3 paths were silently ignoring --readers.
@n0kovo n0kovo force-pushed the refactor/github-gitlab-pipeline branch from bd3d2e3 to 03adeb6 Compare May 5, 2026 20:06
n0kovo added 7 commits May 5, 2026 22:18
…xtractArchivesFlag in tests

pipelineOpts now carries a resolved Accessibility value instead of a
raw token. Each caller decides how to map --accessibility=auto to a
concrete value:

- runScan filesystem: ResolveAccessibility against the local path
  (env-fallback for GITHUB_TOKEN preserved).
- runRepoScan, runS3Scan: existing semantics preserved.

This unblocks per-caller overrides for the GitHub/GitLab subcommands,
which have no local checkout to inspect and would otherwise auto-resolve
to Private (-25 penalty on every finding).

Also extends resetPipelineGlobals to snapshot/restore scanGit and
scanExtractArchivesFlag so future tests can't leak state through them.
n0kovo added 5 commits May 5, 2026 22:18
resolveAutoName returned "<name>.db" even after the github/gitlab default
flipped to titus.ds in this PR, so --output :auto: produced legacy .db
directories. Update suffix to .ds and refresh TestResolveAutoName.
github/gitlab resolve --token (with env fallback); now pass that resolved
value into runPipeline instead of relying on the pipeline to re-read the
env var. Without this, --token without a matching env var produced
unauthenticated accessibility detection.
Mirrors the runRepoScan fix from praetorian-inc#223: clone-based GitHub and GitLab
scans now honor --extract-archives and the extraction / sqlite-row
limit flags via resolveExtractionLimits().
… scans

Remote-clone subcommands have no local checkout for ResolveAccessibility's
git-remote inspection to read, so 'auto' would conservatively fall back to
Private and silently apply a -25 penalty to every finding. Default 'auto'
to Public for these subcommands; explicit --accessibility=private still
takes precedence.
@n0kovo n0kovo force-pushed the refactor/github-gitlab-pipeline branch from 03adeb6 to b311989 Compare May 5, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant