perf(index): max_parse_file_size cap + stage profiler + real-project language corpora by andylbrummer · Pull Request #11 · standardbeagle/lci-cpp

andylbrummer · 2026-06-18T22:51:39Z

Follow-up to #10. Two related pieces; stacked on scip-type-resolution (retarget to main once #9 and #10 land). Reaches release via that merge train.

1. Tree-sitter parse cap (`max_parse_file_size`, default 2 MB)

Profiled redis (1219 files / 15k symbols / 2.7s): tree-sitter parse is ~58% of index CPU, extraction ~25%, trigram ~17%. Parse is already gated to known-language files and binaries are filtered defense-in-depth (extension at scan + magic-number/null-byte content check before any parse). The one residual: a multi-MB source file under the 10 MB file cap still got a full parse for near-zero symbol value — the generated/minified hazard.

Files above index.max_parse_file_size skip the tree-sitter parse + symbol extraction but stay trigram-indexed for text search; the skip is flagged on ProcessedFile, not silently dropped. 0 disables.
KDL config field + parsing + validation.
Verified: a 4.6 MB generated .c → no symbols, still grep-able. redis re-index symbol count unchanged (no file > 2 MB).
MaxParseFileSize.OversizeSourceSkipsParseButStaysSearchable regression test.

2. Stage profiler

IndexProfile.StageBreakdown — manual harness (LCI_PROFILE_DIR=<path>, skipped otherwise). Reports parse/extract/trigram split, throughput, and names the slowest files so a pathological input is surfaced. (perf/valgrind unavailable on this WSL2 kernel; this is the portable substitute.)

3. Real-project corpora for the 7 new languages

Each scope-type-resolution language now has a real upstream repo + a test asserting receiver-type resolution fires on real code:

lang	repo	sentinel → qualified callee
java	gson	`toJson` → `Gson.toJson`
csharp	serilog	`Write` → `Logger.IsEnabled`
rust	ripgrep	`build` → `GlobSetBuilder.add`
php	guzzle	`send` → `Client.sendAsync`
kotlin	okhttp	`intercept` → `Chain.request`
ruby	sinatra	`call` → `ExtendedRack.setup_close`
zig	zls	`resolveTypeOfNode` → `Analyser.resolveBindingOfNode`

Corpora are gitignored (fetched via add-real-projects.sh --full, now covering 13 languages), so repo size is unchanged. Tests skip when absent.

Verification

Unit 1702/1702 (+1 skipped profiler); RealProjectLanguages.* 7/7; MaxParseFileSize.* green.

🤖 Generated with Claude Code

…profiler Profiling a beefy C corpus (redis: 1219 files / 15k symbols / 2.7s) showed tree-sitter parse is ~58% of index CPU, extraction ~25%, trigram ~17%. Parse is correctly gated to known-language files and binaries are filtered (extension + magic-number/null-byte content check before any parse), but a multi-MB *source* file under the 10 MB file cap would still get a full parse for near-zero symbol value — the classic generated/minified hazard. - New index.max_parse_file_size (default 2 MB, 0 disables). Files above it skip the tree-sitter parse + symbol extraction but stay trigram-indexed for text search; the skip is flagged on ProcessedFile rather than silently dropped. - KDL config field + parsing. - IndexProfile.StageBreakdown: manual stage-timing harness (set LCI_PROFILE_DIR to run; skipped otherwise) — reports parse/extract/trigram split, throughput, and names the slowest files so a pathological input is surfaced, not hidden. - MaxParseFileSize.OversizeSourceSkipsParseButStaysSearchable regression test. redis re-index symbol count unchanged (no file > 2 MB there). Full suite 1702/1702 (+1 skipped profiler). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds a real upstream repo per scope-type-resolution language and a test that asserts receiver-type resolution fires on real code (not just controlled corpora). Each repo had no call graph at all before that work. java/gson csharp/serilog rust/ripgrep php/guzzle kotlin/okhttp ruby/sinatra zig/zls - real_project_languages_test.cpp: per language, index the repo and assert a sentinel method resolves to a receiver-type-qualified callee (e.g. gson toJson -> Gson.toJson, serilog Write -> Logger.IsEnabled, ripgrep build -> GlobSetBuilder.add, okhttp intercept -> Chain.request). Skips when the corpus is absent (same pattern as the existing suite). - scripts/add-real-projects.sh --full now fetches all 16 repos (13 languages); .gitignore + mkdir extended for the new language dirs. Corpora are gitignored (fetched, not committed) so repo size is unchanged. 7/7 language tests pass locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

andylbrummer and others added 2 commits June 18, 2026 01:45

andylbrummer merged commit 245c2b7 into scip-type-resolution Jun 19, 2026

andylbrummer mentioned this pull request Jun 19, 2026

feat: type resolution (all 13 langs) + parse cap + real-project corpora [train] #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(index): max_parse_file_size cap + stage profiler + real-project language corpora#11

perf(index): max_parse_file_size cap + stage profiler + real-project language corpora#11
andylbrummer merged 2 commits into
scip-type-resolutionfrom
index-parse-size-cap

andylbrummer commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andylbrummer commented Jun 18, 2026

1. Tree-sitter parse cap (max_parse_file_size, default 2 MB)

2. Stage profiler

3. Real-project corpora for the 7 new languages

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Tree-sitter parse cap (`max_parse_file_size`, default 2 MB)