Add Classifications (for example test / non-test code)#1304
Open
marhel wants to merge 14 commits intoXAMPPRocky:masterfrom
Open
Add Classifications (for example test / non-test code)#1304marhel wants to merge 14 commits intoXAMPPRocky:masterfrom
marhel wants to merge 14 commits intoXAMPPRocky:masterfrom
Conversation
thread 'main' (739870) panicked at src/cli.rs:323:22: Mismatch between definition and access of `file_input`. Could not downcast to &str, need to downcast to alloc::string::String thread 'main' (740008) panicked at src/cli.rs:292:14: Mismatch between definition and access of `streaming`. Could not downcast to &str, need to downcast to alloc::string::String
Change children map from BTreeMap<LanguageType, Vec<Report>> to BTreeMap<String, Vec<Report>> to allow adding child reports under arbitrary names. This is a pure refactoring with no behavior changes.
Adds --classify (-k) flag to specify file classification patterns. Supports both language specific patterns using the format Language:CategoryName:pattern and generic patterns using the format CategoryName:pattern Also refactors cli.rs to simplify tests of CLI parsing - Extract build_command() and from_matches() helper methods - Add from_args_with() test helper
- Add classifications field to Config - Support reading classifications from config files (tokei.toml/.tokeirc) - Add CLI override for classifications (CLI takes precedence over config files) - Update Config::from_config_files to merge classifications from all sources
This field will be used to tag files with an optional classification during processing in order to be able to separate them in the output statistics.
Also adds classify_file that provides the classification, if any, of a path given a language type and classification patterns.
Classification patterns are matched relative to the base paths.
When listing files, include the classification, if any, in the # of files column, which is otherwise unused for individual files. Example (a subset of the output of running tokei on itself): $ tokei --classify FUZZING:fuzz/**/* --files --compact ──────────────────────────────────────────────────────────────────────────────────────── Markdown 5 1735 0 1403 332 ──────────────────────────────────────────────────────────────────────────────────────── ./CHANGELOG.md 898 0 710 188 ./CODE_OF_CONDUCT.md 46 0 28 18 ./CONTRIBUTING.md 164 0 124 40 ./README.md 597 0 521 76 ./fuzz/README.md FUZZING 30 0 20 10 ──────────────────────────────────────────────────────────────────────────────────────── Rust 25 5132 4286 182 664 ──────────────────────────────────────────────────────────────────────────────────────── ./build.rs 166 137 1 28 |gets/parse_from_slice.rs FUZZING 52 40 6 6 |arse_from_slice_panic.rs FUZZING 9 7 0 2 |arse_from_slice_total.rs FUZZING 9 7 0 2 ./src/classification.rs 180 143 10 27 ./src/cli.rs 643 567 20 56
Adds --no-classify (-K) flag to ignore classifications from config files. This way, the user can have sensible default patterns in a home folder config file, but ignore them if those patterns aren't working well with the code in a particular folder.
This prevents the file name (the last line) to overflow into the line count column. Example: -- ./src/language/language_type.tera.rs ------------------------------------------------ |- Rust 358 314 5 39 |- Markdown 112 0 95 17 |/language/language_type.tera.rs 358 314 5 39
When classifications are not used, output remains unchanged. When classifications are enabled (via --classify or read from config), classified files are separated and displayed with their classification names in the terminal output right after embedded languages, using the prefix |> for classifications, to differentiate from embedded languages still using the prefix |- The summary statistics will show: - Unclassified file stats (from regular reports) - Embedded language stats (from all files, both classified and unclassified) - Classified file stats - The (Total) summary, which will include a file count of unclassified + classified files. Example: $ tokei --classify 'FUZZING:fuzz/**/*' --types Rust,Toml ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Language Files Lines Code Comments Blanks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Rust 22 5682 4673 256 753 |- Markdown 15 427 5 366 56 |> FUZZING 3 70 54 6 10 (Total) 25 6179 4732 628 819 ───────────────────────────────────────────────────────────────────────────────── TOML 2 101 87 4 10 |> FUZZING 1 33 25 1 7 (Total) 3 134 112 5 17 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 28 6313 4844 633 836 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Adds --classify-unmatched (-u) flag to specify a classification fallback. Creates a (low priority) catch-all pattern to match all files not matching other classifications. Supports both global and language-specific fallbacks: PROD or C#:PROD. Note that you should only specify a classification name without a pattern. Examples: tokei --classify-unmatched PROD tokei --classify "Tests:**/*_test.js" --classify-unmatched "C#:PROD" --classify-unmatched UTILS
Makes --classify patterns add to patterns from config file instead of replacing them. Patterns are now matched in the following order (first match wins): 1. CLI --classify patterns (highest priority) 2. Config file patterns (middle priority) 3. CLI --classify-unmatched patterns (lowest priority / fallback)
Implement support for folder shorthand syntax where a single word like "tests" (with an optional trailing slash) expands to "tests:tests/**/*", making it easier to classify entire folders. Examples: --classify tests/ is equivalent to tests:tests/**/* --classify benchmarks is equivalent to benchmarks:benchmarks/**/* The expansion happens in ClassificationPattern::parse, ensuring patterns are expanded consistently regardless of how they're created.
Author
|
This should be reviewable per commit, if this is preferred. I've tried to keep each commit focused on one thing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a flexible user-defined glob-based file classification system to tokei, which puts some (or all) files into one of potentially many named classes, which is shown separately in the output as a subcategorization of language, much like the embedded languages.
It can be used to distinguish between test and non-test code (on the file level), or to include generated or vendored code in the total counts, but reported on separate lines. The user can add whatever classification they like as long as files belonging to the class can be identified by a glob pattern.
Also note that while this feature can be used to count test code separately from non-test code (and indeed started out as an effort to achieve just that), this depends on matching the file path/name as belonging to a test (or non-test) class, and thus operates on the file level. Languages like Rust which ideomatically mixes test code with non-test code in the same file cannot be classified by this feature in a way that will report the test code counts separately from non-test code.
Care has been taken to keep the existing output, and to avoid any performance cost when classifications are not used.
Pattern syntax
The classification patterns given to the new
--classify(-k) parameter, supports three different variants:*.spec.tsfiles (in any folder) asTeststests:tests/**/*, Classify all files in thetestsfolder (and subfolders) astests.Api.Example
Running tokei on the NgRx platform repo with
--types JavaScript,TypeScript(and no classifications) gives us the normal tokei output:A quick peek at the NgRx code base shows they have lots of *.spec.ts and *.spec.js files, and also some code separated into spec folders. Trying to identify other big parts, we can see that the projects folder contains the code for a few web sites (including ngrx.io) and also that there a few hundred files in schematics and schematics-core folders. And perhaps we are also content to classify all other TypeScript files as 'Prod'. We can add the following parameters to tokei.
Running tokei with these params gives us:
We can see we have 16 unclassified JavaScript files, and no unclassified TypeScript files (since we provided a fallback classification specifically for TypeScript).
Interplay with embedded languages
If we now just look at Markdown files, keeping the same classifications as above, we get a mix of embedded languages and classifications:
Embedded languages use a
|-prefix, while the classifications are distinguishable by a|>prefix. Since classifications are separate files, but embeddings are not, the(Total)line adds a summary of Markdown (28 unclassified files) + Schema (2) + Sites (319) = 349 files, (matching the total shown next to Markdown if classifications are not used).Note that the
--compactoutput mode removes embedded languages but keeps classifications.Details
Note that this feature can put zero or one classification on a particular file. If a file matches several classifications, the first one wins. The priority order is:
--classify(-k), in the order given--classify-unmatched(-u), in the order givenTo temporarily ignore patterns from configuration, one can add
--no-classify(-K), which removes the configuration files as a source of patterns, but still allows the user to provide their own CLI patterns and fallbacks.When specifying
--files, the classification name (if any) is shown in the # of files column, which is otherwise used for the file name.Performance
I ran hyperfine several times with different number and kinds of patterns over a folder with 133K files and 29M lines of code, comparing to the tokei version from the master branch. Actual performance is quite linear to the number of patterns, and also depends on the complexity of the patterns, and how many patterns on average we need to scan to find a match.
If I add 600 patterns (via a config file) of the form
**/*.suffix(not matching any actual files) it takes twice as long as baseline tokei (on my machine, on that repo, with those patterns). Specifying 300 patterns, results in 50% overhead. 60 patterns 10% overhead.With 600 non-matching patterns added via the configuration, and a catch-all pattern added via CLI (meaning all files matches on the first pattern), we get no difference in runtime compared to baseline tokei, showing that it is scanning
through glob patterns that take time.
I also used hyperfine to determine that globset was consistently 1.1 - 3 times faster than fast_glob, wax or the ignore crate (which is already a dependency).