Skip to content

Conversation

@masatake
Copy link
Member

@masatake masatake commented Jun 8, 2025

main: using regex for choosing a parser for the given file name

This change extends --map- option to support regular
expression matching with the full file name.

The original --map- option supports glob based matching
and extension comparison with the file basename.
However, two methods are not enough if the file names are too
generic. See #3287 .

The regular expression passed to --map- must be surround
by % character like

--map-RpmMacros='%(.*/)?macros.d/macros.([^/]+)$%'

If you want to match in a case-insensitive way, append `i' after the second % like

--map-RpmMacros='%(.*/)?macros.d/macros.([^/]+)$%i'

If you want to use % as part of an expression, put \ before % for escaping.

TODO:

  • reconsider name regex, rxpr, or something
  • update ctags.1
  • add Tmain test cases
  • add description to --help
  • extend optlib2c
  • add --list-map-regex
  • add --list-maps
  • add pcre backend

@masatake masatake marked this pull request as draft June 8, 2025 23:34
@masatake masatake changed the title main: using regex for choosing a paser for given file name [WIP] main: using regex for choosing a paser for given file name Jun 8, 2025
@masatake masatake force-pushed the main--rexpr branch 2 times, most recently from a5a3a28 to c50f467 Compare September 27, 2025 18:32
@codecov
Copy link

codecov bot commented Sep 27, 2025

Codecov Report

❌ Patch coverage is 93.08176% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.02%. Comparing base (6bce4ae) to head (ef09439).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
main/parse.c 96.01% 8 Missing ⚠️
main/rexprcode.c 78.94% 8 Missing ⚠️
main/options.c 93.22% 4 Missing ⚠️
optlib/rpmMacros.c 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4270      +/-   ##
==========================================
+ Coverage   86.01%   86.02%   +0.01%     
==========================================
  Files         250      251       +1     
  Lines       64159    64352     +193     
==========================================
+ Hits        55187    55361     +174     
- Misses       8972     8991      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@masatake masatake changed the title [WIP] main: using regex for choosing a paser for given file name [WIP] main: using regex for choosing a parser for given file name Oct 17, 2025
@masatake masatake requested a review from Copilot October 19, 2025 18:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the --map-<LANG> option to support regular expression matching for file names, addressing limitations where glob patterns and extension matching are insufficient for generic file names. The implementation adds a new regex-based mapping type alongside existing extension and pattern mappings.

Key Changes:

  • Introduced regex pattern support using %regex%[i] syntax for language file mappings
  • Added new rexprcode module to handle regex compilation and matching
  • Extended optlib2c to generate C code from regex mapping definitions

Reviewed Changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
main/rexprcode.c New module implementing regex pattern compilation, matching, and encoding
main/rexprcode_p.h Public interface for regex code operations
main/parse.c Core integration of regex matching into language detection logic
main/parse.h Added rExprSrc structure definition and REXPR_LAST_ENTRY macro
main/parse_p.h Extended langmapType enum with LMAP_REXPR flag
main/options.c Command-line option parsing for regex patterns with icase flag support
optlib/rpmMacros.ctags Example usage replacing commented-out patterns with regex
optlib/rpmMacros.c Generated C code with regex mapping definitions
misc/optlib2c Extended Perl script to parse and generate regex mapping code
source.mak Build system updates for new source files
win32/ctags_vs2013.vcxproj Visual Studio project file updates
win32/ctags_vs2013.vcxproj.filters Visual Studio filter file updates
Tmain/list-map-rexprs.d/* Test cases for new --list-map-rexprs option
Tmain/versioning.d/stdout-expected.txt Updated test output expectations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

misc/optlib2c Outdated
unless ($_[0]->{'langdef'} eq $1);
my $spec = $2;
if ($spec =~ /\((.*)\)/) {
if ($spec =~ /%(.+)%(i)?/) {
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern %(.+)% is greedy and will match incorrectly if the expression contains '%' characters, even with escaping. For example, %a%b%c% would capture a%b%c instead of a. The pattern should be non-greedy: %(.+?)%(i)?/ or better yet, should properly handle escaped '%' characters in the capture group.

Suggested change
if ($spec =~ /%(.+)%(i)?/) {
if ($spec =~ /%(.+?)%(i)?/) {

Copilot uses AI. Check for mistakes.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

misc/optlib2c Outdated
unless ($_[0]->{'langdef'} eq $1);
my $spec = $2;
if ($spec =~ /\((.*)\)/) {
if ($spec =~ /%(.+)%(i)?/) {
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex should use a non-greedy quantifier .+? instead of .+ to prevent matching across multiple patterns when there are multiple % characters in the input. This could cause incorrect parsing of escaped % characters.

Suggested change
if ($spec =~ /%(.+)%(i)?/) {
if ($spec =~ /%(.+?)%(i)?/) {

Copilot uses AI. Check for mistakes.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

misc/optlib2c Outdated
Comment on lines 296 to 298
if ($spec =~ /%(.+?)%(i|\{icase\})?/) {
my $rexpr = { expr => $1,
iCase => (defined $2 && ($2 eq 'i' || $2 eq 'icase'))? 1: 0 };
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex allows {icase} as an alternative to i, but this syntax is not documented in the PR description or help text. Either document this alternative syntax or remove it to avoid confusion.

Suggested change
if ($spec =~ /%(.+?)%(i|\{icase\})?/) {
my $rexpr = { expr => $1,
iCase => (defined $2 && ($2 eq 'i' || $2 eq 'icase'))? 1: 0 };
if ($spec =~ /%(.+?)%(i)?/) {
my $rexpr = { expr => $1,
iCase => (defined $2 && $2 eq 'i')? 1: 0 };

Copilot uses AI. Check for mistakes.


static flagDefinition langmapRexprFlagDef[] = {
{ 'i', "icase", langmap_rexpr_icase_short, langmap_rexpr_icase_long,
NULL, "applied in a case-insensitive manner"},
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long flag name is 'icase', but the optlib2c script also accepts '{icase}' syntax (line 296 in misc/optlib2c). These should be consistent, or the alternative syntax should be documented.

Suggested change
NULL, "applied in a case-insensitive manner"},
NULL, "applied in a case-insensitive manner (accepts both 'icase' and '{icase}' syntax)"},

Copilot uses AI. Check for mistakes.

@masatake masatake force-pushed the main--rexpr branch 2 times, most recently from 231a606 to f865e33 Compare October 20, 2025 02:17
@masatake masatake requested a review from Copilot October 20, 2025 02:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 21 out of 23 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 22 out of 24 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@masatake masatake requested a review from Copilot October 20, 2025 10:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 22 out of 24 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

baseFilenameSansExtensionNew("a.in.in", ".in.in") could not return
"a" with the original code.

Signed-off-by: Masatake YAMATO <[email protected]>
…nsExtensionNew

Delete baseFilenameSansExtensionNew() from the source tree.

Signed-off-by: Masatake YAMATO <[email protected]>
The original code used a boolean value to toggle how filenames were
mapped to the parser by glob-like pattern or by extension.

To support the third way mapping a file name to a parser, by regular
expression pattern, we will use an enum value instead of Boolean.

Signed-off-by: Masatake YAMATO <[email protected]>
This change extends --map-<LANG> option to support regular
expression matching with the full file name.

The original --map-<LANG> option supports the glob based matching
and the extension comparison with the file basename.
However, two methods are not enough if the file names are too
generic. See universal-ctags#3287 .

The regular expression passed to --map-<LANG> must be surrounded
by % character like

   --map-RpmMacros='%(.*/)?macros\.d/macros\.([^/]+)$%'

If you want to match in a case-insensitive way, append `i' after
the second % like

   --map-RpmMacros='%(.*/)?macros\.d/macros\.([^/]+)$%i'

If you want to use % as part of an expression, put \ before %
for escaping.

Signed-off-by: Masatake YAMATO <[email protected]>
Signed-off-by: Masatake YAMATO <[email protected]>
@masatake
Copy link
Member Author

Updating the man page is the last task.

@masatake masatake marked this pull request as ready for review October 21, 2025 08:59
@masatake masatake changed the title [WIP] main: using regex for choosing a parser for given file name main: using regex for choosing a parser for given file name Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant