Add Classifications (for example test / non-test code) by marhel · Pull Request #1304 · XAMPPRocky/tokei

marhel · 2025-12-12T17:44:04Z

This PR adds a flexible user-defined glob-based file classification system to tokei, which puts some (or all) files into one of potentially many named classes, which is shown separately in the output as a subcategorization of language, much like the embedded languages.

It can be used to distinguish between test and non-test code (on the file level), or to include generated or vendored code in the total counts, but reported on separate lines. The user can add whatever classification they like as long as files belonging to the class can be identified by a glob pattern.

Also note that while this feature can be used to count test code separately from non-test code (and indeed started out as an effort to achieve just that), this depends on matching the file path/name as belonging to a test (or non-test) class, and thus operates on the file level. Languages like Rust which ideomatically mixes test code with non-test code in the same file cannot be classified by this feature in a way that will report the test code counts separately from non-test code.

Care has been taken to keep the existing output, and to avoid any performance cost when classifications are not used.

Pattern syntax

The classification patterns given to the new --classify (-k) parameter, supports three different variants:

Syntax	Example	Meaning
Class:Glob	Tests:*/.spec.ts	Classify all `*.spec.ts` files (in any folder) as `Tests`
Foldername	tests/	Shorthand for `tests:tests/*/`, Classify all files in the `tests` folder (and subfolders) as `tests`.
Language:Class:Glob	C#:Api:*/.API/**	Classify C# language files in folders having an .API-suffix as `Api`.

Example

Running tokei on the NgRx platform repo with --types JavaScript,TypeScript (and no classifications) gives us the normal tokei output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 TypeScript             1415       163862       131627        13769        18466
 JavaScript              262        14540        11792         1197         1551
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1677       178402       143419        14966        20017
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

A quick peek at the NgRx code base shows they have lots of *.spec.ts and *.spec.js files, and also some code separated into spec folders. Trying to identify other big parts, we can see that the projects folder contains the code for a few web sites (including ngrx.io) and also that there a few hundred files in schematics and schematics-core folders. And perhaps we are also content to classify all other TypeScript files as 'Prod'. We can add the following parameters to tokei.

Param	Pattern	Matches
--classify	Tests:*/.spec.*	the spec files
--classify	Tests:/spec/	anything in 'spec' folders
--classify	Sites:projects/**	web site folders
--classify	Schema:*/schematics/**	anything in a folder with 'schematics' prefix
--classify-unmatched	TypeScript:Prod	All TypeScript files not matched by any other classification

Running tokei with these params gives us:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 JavaScript               16          781          748            5           28
 |> Schema                20          731          705            0           26
 |> Sites                159         7511         5588         1166          757
 |> Tests                 67         5517         4751           26          740
 (Total)                 262        14540        11792         1197         1551
─────────────────────────────────────────────────────────────────────────────────
 TypeScript
 |> Prod                 368        33143        24794         5411         2938
 |> Schema               268        40078        30018         5688         4372
 |> Sites                414        19708        16029         1737         1942
 |> Tests                365        70933        60786          933         9214
 (Total)                1415       163862       131627        13769        18466
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1677       178402       143419        14966        20017
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We can see we have 16 unclassified JavaScript files, and no unclassified TypeScript files (since we provided a fallback classification specifically for TypeScript).

Interplay with embedded languages

If we now just look at Markdown files, keeping the same classifications as above, we get a mix of embedded languages and classifications:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Markdown                 28         4177            0         2752         1425
 |- BASH                   4           19           12            4            3
 |- HTML                  13          157          133            0           24
 |- JSON                  22          165          165            0            0
 |- JavaScript             4           18           16            2            0
 |- Shell                  1            6            6            0            0
 |- TypeScript            22          523          469           26           28
 |> Schema                 2            8            0            5            3
 |> Sites                319        42141          779        29622        11740
 (Total)                 349        46348          801        32379        13168
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   349        46348          801        32379        13168
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Embedded languages use a |- prefix, while the classifications are distinguishable by a |> prefix. Since classifications are separate files, but embeddings are not, the (Total) line adds a summary of Markdown (28 unclassified files) + Schema (2) + Sites (319) = 349 files, (matching the total shown next to Markdown if classifications are not used).

Note that the --compact output mode removes embedded languages but keeps classifications.

Details

Note that this feature can put zero or one classification on a particular file. If a file matches several classifications, the first one wins. The priority order is:

CLI patterns --classify (-k), in the order given
Config patterns (from classifications = [...] in tokei.toml/.tokeirc), in the order given
CLI fallback patterns --classify-unmatched (-u), in the order given

To temporarily ignore patterns from configuration, one can add --no-classify (-K), which removes the configuration files as a source of patterns, but still allows the user to provide their own CLI patterns and fallbacks.

When specifying --files, the classification name (if any) is shown in the # of files column, which is otherwise used for the file name.

Performance

I ran hyperfine several times with different number and kinds of patterns over a folder with 133K files and 29M lines of code, comparing to the tokei version from the master branch. Actual performance is quite linear to the number of patterns, and also depends on the complexity of the patterns, and how many patterns on average we need to scan to find a match.

If I add 600 patterns (via a config file) of the form **/*.suffix (not matching any actual files) it takes twice as long as baseline tokei (on my machine, on that repo, with those patterns). Specifying 300 patterns, results in 50% overhead. 60 patterns 10% overhead.

With 600 non-matching patterns added via the configuration, and a catch-all pattern added via CLI (meaning all files matches on the first pattern), we get no difference in runtime compared to baseline tokei, showing that it is scanning
through glob patterns that take time.

I also used hyperfine to determine that globset was consistently 1.1 - 3 times faster than fast_glob, wax or the ignore crate (which is already a dependency).

thread 'main' (739870) panicked at src/cli.rs:323:22: Mismatch between definition and access of `file_input`. Could not downcast to &str, need to downcast to alloc::string::String thread 'main' (740008) panicked at src/cli.rs:292:14: Mismatch between definition and access of `streaming`. Could not downcast to &str, need to downcast to alloc::string::String

Change children map from BTreeMap<LanguageType, Vec<Report>> to BTreeMap<String, Vec<Report>> to allow adding child reports under arbitrary names. This is a pure refactoring with no behavior changes.

Adds --classify (-k) flag to specify file classification patterns. Supports both language specific patterns using the format Language:CategoryName:pattern and generic patterns using the format CategoryName:pattern Also refactors cli.rs to simplify tests of CLI parsing - Extract build_command() and from_matches() helper methods - Add from_args_with() test helper

- Add classifications field to Config - Support reading classifications from config files (tokei.toml/.tokeirc) - Add CLI override for classifications (CLI takes precedence over config files) - Update Config::from_config_files to merge classifications from all sources

This field will be used to tag files with an optional classification during processing in order to be able to separate them in the output statistics.

Also adds classify_file that provides the classification, if any, of a path given a language type and classification patterns.

Classification patterns are matched relative to the base paths.

When listing files, include the classification, if any, in the # of files column, which is otherwise unused for individual files. Example (a subset of the output of running tokei on itself): $ tokei --classify FUZZING:fuzz/**/* --files --compact ──────────────────────────────────────────────────────────────────────────────────────── Markdown 5 1735 0 1403 332 ──────────────────────────────────────────────────────────────────────────────────────── ./CHANGELOG.md 898 0 710 188 ./CODE_OF_CONDUCT.md 46 0 28 18 ./CONTRIBUTING.md 164 0 124 40 ./README.md 597 0 521 76 ./fuzz/README.md FUZZING 30 0 20 10 ──────────────────────────────────────────────────────────────────────────────────────── Rust 25 5132 4286 182 664 ──────────────────────────────────────────────────────────────────────────────────────── ./build.rs 166 137 1 28 |gets/parse_from_slice.rs FUZZING 52 40 6 6 |arse_from_slice_panic.rs FUZZING 9 7 0 2 |arse_from_slice_total.rs FUZZING 9 7 0 2 ./src/classification.rs 180 143 10 27 ./src/cli.rs 643 567 20 56

Adds --no-classify (-K) flag to ignore classifications from config files. This way, the user can have sensible default patterns in a home folder config file, but ignore them if those patterns aren't working well with the code in a particular folder.

This prevents the file name (the last line) to overflow into the line count column. Example: -- ./src/language/language_type.tera.rs ------------------------------------------------ |- Rust 358 314 5 39 |- Markdown 112 0 95 17 |/language/language_type.tera.rs 358 314 5 39

When classifications are not used, output remains unchanged. When classifications are enabled (via --classify or read from config), classified files are separated and displayed with their classification names in the terminal output right after embedded languages, using the prefix |> for classifications, to differentiate from embedded languages still using the prefix |- The summary statistics will show: - Unclassified file stats (from regular reports) - Embedded language stats (from all files, both classified and unclassified) - Classified file stats - The (Total) summary, which will include a file count of unclassified + classified files. Example: $ tokei --classify 'FUZZING:fuzz/**/*' --types Rust,Toml ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Language Files Lines Code Comments Blanks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Rust 22 5682 4673 256 753 |- Markdown 15 427 5 366 56 |> FUZZING 3 70 54 6 10 (Total) 25 6179 4732 628 819 ───────────────────────────────────────────────────────────────────────────────── TOML 2 101 87 4 10 |> FUZZING 1 33 25 1 7 (Total) 3 134 112 5 17 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 28 6313 4844 633 836 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Adds --classify-unmatched (-u) flag to specify a classification fallback. Creates a (low priority) catch-all pattern to match all files not matching other classifications. Supports both global and language-specific fallbacks: PROD or C#:PROD. Note that you should only specify a classification name without a pattern. Examples: tokei --classify-unmatched PROD tokei --classify "Tests:**/*_test.js" --classify-unmatched "C#:PROD" --classify-unmatched UTILS

Makes --classify patterns add to patterns from config file instead of replacing them. Patterns are now matched in the following order (first match wins): 1. CLI --classify patterns (highest priority) 2. Config file patterns (middle priority) 3. CLI --classify-unmatched patterns (lowest priority / fallback)

Implement support for folder shorthand syntax where a single word like "tests" (with an optional trailing slash) expands to "tests:tests/**/*", making it easier to classify entire folders. Examples: --classify tests/ is equivalent to tests:tests/**/* --classify benchmarks is equivalent to benchmarks:benchmarks/**/* The expansion happens in ClassificationPattern::parse, ensuring patterns are expanded consistently regardless of how they're created.

marhel · 2025-12-12T17:50:45Z

This should be reviewable per commit, if this is preferred. I've tried to keep each commit focused on one thing.

marhel added 14 commits December 10, 2025 23:46

Refactor Language.children to use String keys instead of LanguageType

db846e4

Change children map from BTreeMap<LanguageType, Vec<Report>> to BTreeMap<String, Vec<Report>> to allow adding child reports under arbitrary names. This is a pure refactoring with no behavior changes.

Add optional classification field to Report struct

910c5df

This field will be used to tag files with an optional classification during processing in order to be able to separate them in the output statistics.

Add ClassificationPattern struct

735333d

Also adds classify_file that provides the classification, if any, of a path given a language type and classification patterns.

Wire classification matching into file processing

5f81920

Classification patterns are matched relative to the base paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Classifications (for example test / non-test code)#1304

Add Classifications (for example test / non-test code)#1304
marhel wants to merge 14 commits intoXAMPPRocky:masterfrom
marhel:classifications

marhel commented Dec 12, 2025

Uh oh!

marhel commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

marhel commented Dec 12, 2025

Pattern syntax

Example

Interplay with embedded languages

Details

Performance

Uh oh!

marhel commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant