Skip to content

Add Classifications (for example test / non-test code)#1304

Open
marhel wants to merge 14 commits intoXAMPPRocky:masterfrom
marhel:classifications
Open

Add Classifications (for example test / non-test code)#1304
marhel wants to merge 14 commits intoXAMPPRocky:masterfrom
marhel:classifications

Conversation

@marhel
Copy link
Copy Markdown

@marhel marhel commented Dec 12, 2025

This PR adds a flexible user-defined glob-based file classification system to tokei, which puts some (or all) files into one of potentially many named classes, which is shown separately in the output as a subcategorization of language, much like the embedded languages.

It can be used to distinguish between test and non-test code (on the file level), or to include generated or vendored code in the total counts, but reported on separate lines. The user can add whatever classification they like as long as files belonging to the class can be identified by a glob pattern.

Also note that while this feature can be used to count test code separately from non-test code (and indeed started out as an effort to achieve just that), this depends on matching the file path/name as belonging to a test (or non-test) class, and thus operates on the file level. Languages like Rust which ideomatically mixes test code with non-test code in the same file cannot be classified by this feature in a way that will report the test code counts separately from non-test code.

Care has been taken to keep the existing output, and to avoid any performance cost when classifications are not used.

Pattern syntax

The classification patterns given to the new --classify (-k) parameter, supports three different variants:

Syntax Example Meaning
Class:Glob Tests:**/*.spec.ts Classify all *.spec.ts files (in any folder) as Tests
Foldername tests/ Shorthand for tests:tests/**/*, Classify all files in the tests folder (and subfolders) as tests.
Language:Class:Glob C#:Api:**/*.API/** Classify C# language files in folders having an .API-suffix as Api.

Example

Running tokei on the NgRx platform repo with --types JavaScript,TypeScript (and no classifications) gives us the normal tokei output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 TypeScript             1415       163862       131627        13769        18466
 JavaScript              262        14540        11792         1197         1551
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1677       178402       143419        14966        20017
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

A quick peek at the NgRx code base shows they have lots of *.spec.ts and *.spec.js files, and also some code separated into spec folders. Trying to identify other big parts, we can see that the projects folder contains the code for a few web sites (including ngrx.io) and also that there a few hundred files in schematics and schematics-core folders. And perhaps we are also content to classify all other TypeScript files as 'Prod'. We can add the following parameters to tokei.

Param Pattern Matches
--classify Tests:**/*.spec.* the spec files
--classify Tests:**/spec/** anything in 'spec' folders
--classify Sites:projects/** web site folders
--classify Schema:**/schematics*/** anything in a folder with 'schematics' prefix
--classify-unmatched TypeScript:Prod All TypeScript files not matched by any other classification

Running tokei with these params gives us:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 JavaScript               16          781          748            5           28
 |> Schema                20          731          705            0           26
 |> Sites                159         7511         5588         1166          757
 |> Tests                 67         5517         4751           26          740
 (Total)                 262        14540        11792         1197         1551
─────────────────────────────────────────────────────────────────────────────────
 TypeScript
 |> Prod                 368        33143        24794         5411         2938
 |> Schema               268        40078        30018         5688         4372
 |> Sites                414        19708        16029         1737         1942
 |> Tests                365        70933        60786          933         9214
 (Total)                1415       163862       131627        13769        18466
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1677       178402       143419        14966        20017
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We can see we have 16 unclassified JavaScript files, and no unclassified TypeScript files (since we provided a fallback classification specifically for TypeScript).

Interplay with embedded languages

If we now just look at Markdown files, keeping the same classifications as above, we get a mix of embedded languages and classifications:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Markdown                 28         4177            0         2752         1425
 |- BASH                   4           19           12            4            3
 |- HTML                  13          157          133            0           24
 |- JSON                  22          165          165            0            0
 |- JavaScript             4           18           16            2            0
 |- Shell                  1            6            6            0            0
 |- TypeScript            22          523          469           26           28
 |> Schema                 2            8            0            5            3
 |> Sites                319        42141          779        29622        11740
 (Total)                 349        46348          801        32379        13168
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   349        46348          801        32379        13168
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Embedded languages use a |- prefix, while the classifications are distinguishable by a |> prefix. Since classifications are separate files, but embeddings are not, the (Total) line adds a summary of Markdown (28 unclassified files) + Schema (2) + Sites (319) = 349 files, (matching the total shown next to Markdown if classifications are not used).

Note that the --compact output mode removes embedded languages but keeps classifications.

Details

Note that this feature can put zero or one classification on a particular file. If a file matches several classifications, the first one wins. The priority order is:

  1. CLI patterns --classify (-k), in the order given
  2. Config patterns (from classifications = [...] in tokei.toml/.tokeirc), in the order given
  3. CLI fallback patterns --classify-unmatched (-u), in the order given

To temporarily ignore patterns from configuration, one can add --no-classify (-K), which removes the configuration files as a source of patterns, but still allows the user to provide their own CLI patterns and fallbacks.

When specifying --files, the classification name (if any) is shown in the # of files column, which is otherwise used for the file name.

Performance

I ran hyperfine several times with different number and kinds of patterns over a folder with 133K files and 29M lines of code, comparing to the tokei version from the master branch. Actual performance is quite linear to the number of patterns, and also depends on the complexity of the patterns, and how many patterns on average we need to scan to find a match.

If I add 600 patterns (via a config file) of the form **/*.suffix (not matching any actual files) it takes twice as long as baseline tokei (on my machine, on that repo, with those patterns). Specifying 300 patterns, results in 50% overhead. 60 patterns 10% overhead.

With 600 non-matching patterns added via the configuration, and a catch-all pattern added via CLI (meaning all files matches on the first pattern), we get no difference in runtime compared to baseline tokei, showing that it is scanning
through glob patterns that take time.

I also used hyperfine to determine that globset was consistently 1.1 - 3 times faster than fast_glob, wax or the ignore crate (which is already a dependency).

thread 'main' (739870) panicked at src/cli.rs:323:22:
Mismatch between definition and access of `file_input`. Could not downcast to &str, need to downcast to alloc::string::String

thread 'main' (740008) panicked at src/cli.rs:292:14:
Mismatch between definition and access of `streaming`. Could not downcast to &str, need to downcast to alloc::string::String
Change children map from BTreeMap<LanguageType, Vec<Report>> to
BTreeMap<String, Vec<Report>> to allow adding child reports under
arbitrary names.

This is a pure refactoring with no behavior changes.
Adds --classify (-k) flag to specify file classification patterns.
Supports both language specific patterns using the format
Language:CategoryName:pattern and generic patterns using the format
CategoryName:pattern

Also refactors cli.rs to simplify tests of CLI parsing
- Extract build_command() and from_matches() helper methods
- Add from_args_with() test helper
- Add classifications field to Config
- Support reading classifications from config files
  (tokei.toml/.tokeirc)
- Add CLI override for classifications (CLI takes precedence over config
  files)
- Update Config::from_config_files to merge classifications from all
  sources
This field will be used to tag files with an optional classification
during processing in order to be able to separate them in the output
statistics.
Also adds classify_file that provides the classification, if any, of a
path given a language type and classification patterns.
Classification patterns are matched relative to the base paths.
When listing files, include the classification, if any, in the # of
files column, which is otherwise unused for individual files.

Example (a subset of the output of running tokei on itself):
$ tokei --classify FUZZING:fuzz/**/* --files --compact

────────────────────────────────────────────────────────────────────────────────────────
 Markdown                         5         1735            0         1403          332
────────────────────────────────────────────────────────────────────────────────────────
 ./CHANGELOG.md                              898            0          710          188
 ./CODE_OF_CONDUCT.md                         46            0           28           18
 ./CONTRIBUTING.md                           164            0          124           40
 ./README.md                                 597            0          521           76
 ./fuzz/README.md          FUZZING            30            0           20           10
────────────────────────────────────────────────────────────────────────────────────────
 Rust                            25         5132         4286          182          664
────────────────────────────────────────────────────────────────────────────────────────
 ./build.rs                                  166          137            1           28
 |gets/parse_from_slice.rs FUZZING            52           40            6            6
 |arse_from_slice_panic.rs FUZZING             9            7            0            2
 |arse_from_slice_total.rs FUZZING             9            7            0            2
 ./src/classification.rs                     180          143           10           27
 ./src/cli.rs                                643          567           20           56
Adds --no-classify (-K) flag to ignore classifications from config
files.

This way, the user can have sensible default patterns in a home folder
config file, but ignore them if those patterns aren't working well with
the code in a particular folder.
This prevents the file name (the last line) to overflow into the line
count column.

Example:

-- ./src/language/language_type.tera.rs ------------------------------------------------
 |- Rust                                     358          314            5           39
 |- Markdown                                 112            0           95           17
 |/language/language_type.tera.rs            358          314            5           39
When classifications are not used, output remains unchanged.

When classifications are enabled (via --classify or read from config),
classified files are separated and displayed with their classification
names in the terminal output right after embedded languages, using the
prefix |> for classifications, to differentiate from embedded languages
still using the prefix |-

The summary statistics will show:
- Unclassified file stats (from regular reports)
- Embedded language stats (from all files, both classified and
  unclassified)
- Classified file stats
- The (Total) summary, which will include a file count of unclassified +
  classified files.

Example:
$ tokei --classify 'FUZZING:fuzz/**/*' --types Rust,Toml
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Rust                     22         5682         4673          256          753
 |- Markdown              15          427            5          366           56
 |> FUZZING                3           70           54            6           10
 (Total)                  25         6179         4732          628          819
─────────────────────────────────────────────────────────────────────────────────
 TOML                      2          101           87            4           10
 |> FUZZING                1           33           25            1            7
 (Total)                   3          134          112            5           17
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                    28         6313         4844          633          836
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Adds --classify-unmatched (-u) flag to specify a classification
fallback. Creates a (low priority) catch-all pattern to match all files
not matching other classifications.

Supports both global and language-specific fallbacks: PROD or C#:PROD.
Note that you should only specify a classification name without a
pattern.

Examples:
  tokei --classify-unmatched PROD
  tokei --classify "Tests:**/*_test.js" --classify-unmatched "C#:PROD" --classify-unmatched UTILS
Makes --classify patterns add to patterns from config file instead of
replacing them.

Patterns are now matched in the following order (first match wins):
1. CLI --classify patterns (highest priority)
2. Config file patterns (middle priority)
3. CLI --classify-unmatched patterns (lowest priority / fallback)
Implement support for folder shorthand syntax where a single word like
"tests" (with an optional trailing slash) expands to "tests:tests/**/*",
making it easier to classify entire folders.

Examples:
  --classify tests/      is equivalent to  tests:tests/**/*
  --classify benchmarks  is equivalent to  benchmarks:benchmarks/**/*

The expansion happens in ClassificationPattern::parse, ensuring
patterns are expanded consistently regardless of how they're created.
@marhel
Copy link
Copy Markdown
Author

marhel commented Dec 12, 2025

This should be reviewable per commit, if this is preferred. I've tried to keep each commit focused on one thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant