perf(index): max_parse_file_size cap + stage profiler + real-project language corpora#11
Merged
Merged
Conversation
…profiler Profiling a beefy C corpus (redis: 1219 files / 15k symbols / 2.7s) showed tree-sitter parse is ~58% of index CPU, extraction ~25%, trigram ~17%. Parse is correctly gated to known-language files and binaries are filtered (extension + magic-number/null-byte content check before any parse), but a multi-MB *source* file under the 10 MB file cap would still get a full parse for near-zero symbol value — the classic generated/minified hazard. - New index.max_parse_file_size (default 2 MB, 0 disables). Files above it skip the tree-sitter parse + symbol extraction but stay trigram-indexed for text search; the skip is flagged on ProcessedFile rather than silently dropped. - KDL config field + parsing. - IndexProfile.StageBreakdown: manual stage-timing harness (set LCI_PROFILE_DIR to run; skipped otherwise) — reports parse/extract/trigram split, throughput, and names the slowest files so a pathological input is surfaced, not hidden. - MaxParseFileSize.OversizeSourceSkipsParseButStaysSearchable regression test. redis re-index symbol count unchanged (no file > 2 MB there). Full suite 1702/1702 (+1 skipped profiler). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a real upstream repo per scope-type-resolution language and a test that asserts receiver-type resolution fires on real code (not just controlled corpora). Each repo had no call graph at all before that work. java/gson csharp/serilog rust/ripgrep php/guzzle kotlin/okhttp ruby/sinatra zig/zls - real_project_languages_test.cpp: per language, index the repo and assert a sentinel method resolves to a receiver-type-qualified callee (e.g. gson toJson -> Gson.toJson, serilog Write -> Logger.IsEnabled, ripgrep build -> GlobSetBuilder.add, okhttp intercept -> Chain.request). Skips when the corpus is absent (same pattern as the existing suite). - scripts/add-real-projects.sh --full now fetches all 16 repos (13 languages); .gitignore + mkdir extended for the new language dirs. Corpora are gitignored (fetched, not committed) so repo size is unchanged. 7/7 language tests pass locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #10. Two related pieces; stacked on
scip-type-resolution(retarget tomainonce #9 and #10 land). Reaches release via that merge train.1. Tree-sitter parse cap (
max_parse_file_size, default 2 MB)Profiled redis (1219 files / 15k symbols / 2.7s): tree-sitter parse is ~58% of index CPU, extraction ~25%, trigram ~17%. Parse is already gated to known-language files and binaries are filtered defense-in-depth (extension at scan + magic-number/null-byte content check before any parse). The one residual: a multi-MB source file under the 10 MB file cap still got a full parse for near-zero symbol value — the generated/minified hazard.
index.max_parse_file_sizeskip the tree-sitter parse + symbol extraction but stay trigram-indexed for text search; the skip is flagged onProcessedFile, not silently dropped.0disables..c→ no symbols, still grep-able. redis re-index symbol count unchanged (no file > 2 MB).MaxParseFileSize.OversizeSourceSkipsParseButStaysSearchableregression test.2. Stage profiler
IndexProfile.StageBreakdown— manual harness (LCI_PROFILE_DIR=<path>, skipped otherwise). Reports parse/extract/trigram split, throughput, and names the slowest files so a pathological input is surfaced. (perf/valgrind unavailable on this WSL2 kernel; this is the portable substitute.)3. Real-project corpora for the 7 new languages
Each scope-type-resolution language now has a real upstream repo + a test asserting receiver-type resolution fires on real code:
toJson→Gson.toJsonWrite→Logger.IsEnabledbuild→GlobSetBuilder.addsend→Client.sendAsyncintercept→Chain.requestcall→ExtendedRack.setup_closeresolveTypeOfNode→Analyser.resolveBindingOfNodeCorpora are gitignored (fetched via
add-real-projects.sh --full, now covering 13 languages), so repo size is unchanged. Tests skip when absent.Verification
RealProjectLanguages.*7/7;MaxParseFileSize.*green.🤖 Generated with Claude Code