Skip to content

perf(index): max_parse_file_size cap + stage profiler + real-project language corpora#11

Merged
andylbrummer merged 2 commits into
scip-type-resolutionfrom
index-parse-size-cap
Jun 19, 2026
Merged

perf(index): max_parse_file_size cap + stage profiler + real-project language corpora#11
andylbrummer merged 2 commits into
scip-type-resolutionfrom
index-parse-size-cap

Conversation

@andylbrummer

Copy link
Copy Markdown
Member

Follow-up to #10. Two related pieces; stacked on scip-type-resolution (retarget to main once #9 and #10 land). Reaches release via that merge train.

1. Tree-sitter parse cap (max_parse_file_size, default 2 MB)

Profiled redis (1219 files / 15k symbols / 2.7s): tree-sitter parse is ~58% of index CPU, extraction ~25%, trigram ~17%. Parse is already gated to known-language files and binaries are filtered defense-in-depth (extension at scan + magic-number/null-byte content check before any parse). The one residual: a multi-MB source file under the 10 MB file cap still got a full parse for near-zero symbol value — the generated/minified hazard.

  • Files above index.max_parse_file_size skip the tree-sitter parse + symbol extraction but stay trigram-indexed for text search; the skip is flagged on ProcessedFile, not silently dropped. 0 disables.
  • KDL config field + parsing + validation.
  • Verified: a 4.6 MB generated .c → no symbols, still grep-able. redis re-index symbol count unchanged (no file > 2 MB).
  • MaxParseFileSize.OversizeSourceSkipsParseButStaysSearchable regression test.

2. Stage profiler

IndexProfile.StageBreakdown — manual harness (LCI_PROFILE_DIR=<path>, skipped otherwise). Reports parse/extract/trigram split, throughput, and names the slowest files so a pathological input is surfaced. (perf/valgrind unavailable on this WSL2 kernel; this is the portable substitute.)

3. Real-project corpora for the 7 new languages

Each scope-type-resolution language now has a real upstream repo + a test asserting receiver-type resolution fires on real code:

lang repo sentinel → qualified callee
java gson toJsonGson.toJson
csharp serilog WriteLogger.IsEnabled
rust ripgrep buildGlobSetBuilder.add
php guzzle sendClient.sendAsync
kotlin okhttp interceptChain.request
ruby sinatra callExtendedRack.setup_close
zig zls resolveTypeOfNodeAnalyser.resolveBindingOfNode

Corpora are gitignored (fetched via add-real-projects.sh --full, now covering 13 languages), so repo size is unchanged. Tests skip when absent.

Verification

  • Unit 1702/1702 (+1 skipped profiler); RealProjectLanguages.* 7/7; MaxParseFileSize.* green.

🤖 Generated with Claude Code

andylbrummer and others added 2 commits June 18, 2026 01:45
…profiler

Profiling a beefy C corpus (redis: 1219 files / 15k symbols / 2.7s) showed
tree-sitter parse is ~58% of index CPU, extraction ~25%, trigram ~17%. Parse is
correctly gated to known-language files and binaries are filtered (extension +
magic-number/null-byte content check before any parse), but a multi-MB
*source* file under the 10 MB file cap would still get a full parse for
near-zero symbol value — the classic generated/minified hazard.

- New index.max_parse_file_size (default 2 MB, 0 disables). Files above it skip
  the tree-sitter parse + symbol extraction but stay trigram-indexed for text
  search; the skip is flagged on ProcessedFile rather than silently dropped.
- KDL config field + parsing.
- IndexProfile.StageBreakdown: manual stage-timing harness (set LCI_PROFILE_DIR
  to run; skipped otherwise) — reports parse/extract/trigram split, throughput,
  and names the slowest files so a pathological input is surfaced, not hidden.
- MaxParseFileSize.OversizeSourceSkipsParseButStaysSearchable regression test.

redis re-index symbol count unchanged (no file > 2 MB there). Full suite
1702/1702 (+1 skipped profiler).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a real upstream repo per scope-type-resolution language and a test that
asserts receiver-type resolution fires on real code (not just controlled
corpora). Each repo had no call graph at all before that work.

  java/gson  csharp/serilog  rust/ripgrep  php/guzzle
  kotlin/okhttp  ruby/sinatra  zig/zls

- real_project_languages_test.cpp: per language, index the repo and assert a
  sentinel method resolves to a receiver-type-qualified callee
  (e.g. gson toJson -> Gson.toJson, serilog Write -> Logger.IsEnabled,
  ripgrep build -> GlobSetBuilder.add, okhttp intercept -> Chain.request).
  Skips when the corpus is absent (same pattern as the existing suite).
- scripts/add-real-projects.sh --full now fetches all 16 repos (13 languages);
  .gitignore + mkdir extended for the new language dirs.

Corpora are gitignored (fetched, not committed) so repo size is unchanged.
7/7 language tests pass locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant