fix(lexer): compute span ends by character count, not byte length#72
Merged
Conversation
The lexer scans the source as a `Vec<char>`, so token positions and AST spans are character indices. Span ends, however, were computed as `pos + value.len()` — adding the token value's *byte* length to a *character* position. For any token containing multibyte UTF-8 the end overshot by the byte/char delta, so `Node::source_text` returned the wrong slice for every non-final token (the final token's overshoot clamped harmlessly to the source end, which is why the existing multibyte test never caught it). Use `value.chars().count()` at the four sites that derive an end offset from a start position (`last_token_end`, the two direct `Span::new` calls for word nodes, and `Token::adjacent_to`), and correct the `Span` doc comment to state that offsets are character indices. Surfaced downstream as tokf #383: compound shell commands mixing an emoji/checkmark `echo` with a rewritten pipe were spliced into invalid shell. Adds regression tests for multibyte in non-final words, compound segments, pipelines, and astral-plane emoji. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mpecan
pushed a commit
that referenced
this pull request
Jun 20, 2026
🤖 I have created a release *beep* *boop* --- ## [0.2.1](rable-v0.2.0...rable-v0.2.1) (2026-06-20) ### Bug Fixes * **lexer:** compute span ends by character count, not byte length ([#72](#72)) ([58b2e50](58b2e50)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
mpecan
added a commit
to mpecan/tokf
that referenced
this pull request
Jun 20, 2026
Closes #383 ## Summary Compound shell commands that mixed a multibyte UTF-8 character (`✓`, `✗`, emoji, CJK) in one segment with a chain operator (`&&`/`||`/`;`/newline) were rewritten into **invalid shell** — spurious `|`/`;` tokens were spliced at the wrong byte offsets, so the command failed to parse. This broke every `tokf hook`-driven rewrite that mixed an emoji/checkmark `echo` with a filterable pipe. ### Before / after ``` in: echo "✓ valid" || echo "✗ bad"; ls ~/.claude/skills/ | grep tokf was: echo "✓ valid" | |echo "✗ bad";tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/ ← invalid now: echo "✓ valid" || echo "✗ bad"; tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/ ← valid ``` ## Root cause Two layers: 1. **Upstream (rable):** AST spans are character indices (the lexer scans a `Vec<char>`), but span *ends* were computed as `pos + value.len()` — a char position plus a **byte** length. For multibyte input the end overshot, so even `Node::source_text` returned the wrong slice for non-final tokens. Fixed in **mpecan/rable#72**, released as **rable 0.2.1**. 2. **tokf:** `compound_segments`/`flatten_segments` sliced the byte-indexed source string by raw (character) span values. Now routed through a `char_to_byte` helper that mirrors rable's own conversion. The issue's suspected culprit (`rules.rs:75`, the `{rest}` slice) was a red herring — regex offsets are byte offsets by design. ## Changes - `char_to_byte` helper + the two separator-slice sites in `rewrite/bash_ast.rs`. - Bump `rable` 0.1.15 → 0.2.1 (absorbs the 0.2.0 API surface changes; tokf only uses `rable::parse` + `rable::ast`). ## Tests Three layers of regression coverage — all previously failing, now green: - **AST round-trip + exact separators** — new `bash_ast_multibyte_tests.rs` (split out to stay under the 700-line file limit). - **In-process rewrite** — `tests_compound.rs`. - **Real binary an LLM hook invokes** — `cli_rewrite.rs`. `cargo test -p tokf` (989 lib + 59 CLI), clippy, and fmt all clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The lexer scans the source as a
Vec<char>, so token positions and ASTSpans are character indices. But span ends were computed aspos + value.len()— adding the token value's byte length to a character position. For any token containing multibyte UTF-8, the end overshoots by the byte/char delta, soNode::source_textreturns the wrong slice for every non-final token.The existing
source_text_multibyte_utf8test never caught this because its multibyte word was the last token, where the overshoot clamps harmlessly to the source end.Example (before)
After
The fix
Use
value.chars().count()at the four sites that derive an end offset from a start position:lexer/mod.rs—last_token_end(the source of most node ends viaparser/mod.rs:134)parser/simple_command.rs— word-node spanparser/case_select.rs— pattern word spantoken.rs—Token::adjacent_to…and correct the
Spandoc comment to state offsets are character indices.Downstream impact
Surfaced as tokf #383: compound shell commands mixing an emoji/
✓/✗echowith a rewritten pipe were spliced into invalid shell (| ||, dropped;). tokf consumes these spans to split compound commands.Tests
Adds regression tests for multibyte in non-final words, compound segments, pipelines, and astral-plane emoji (
🎉). Full suite: 262 passing, clippy + fmt clean. No ASCII spans change (char count == byte len for ASCII), so no existing test moved.🤖 Generated with Claude Code