fix(lexer): compute span ends by character count, not byte length by mpecan · Pull Request #72 · mpecan/rable

mpecan · 2026-06-20T08:06:45Z

Summary

The lexer scans the source as a Vec<char>, so token positions and AST Spans are character indices. But span ends were computed as pos + value.len() — adding the token value's byte length to a character position. For any token containing multibyte UTF-8, the end overshoots by the byte/char delta, so Node::source_text returns the wrong slice for every non-final token.

The existing source_text_multibyte_utf8 test never caught this because its multibyte word was the last token, where the overshoot clamps harmlessly to the source end.

Example (before)

echo "✓ valid" || echo "✗ bad"
  Command 0..16  source_text = `echo "✓ valid" |`   ← wrong (bleeds into the `||`)

After

  Command 0..14  source_text = `echo "✓ valid"`     ← correct

The fix

Use value.chars().count() at the four sites that derive an end offset from a start position:

lexer/mod.rs — last_token_end (the source of most node ends via parser/mod.rs:134)
parser/simple_command.rs — word-node span
parser/case_select.rs — pattern word span
token.rs — Token::adjacent_to

…and correct the Span doc comment to state offsets are character indices.

Downstream impact

Surfaced as tokf #383: compound shell commands mixing an emoji/✓/✗ echo with a rewritten pipe were spliced into invalid shell (| ||, dropped ; ). tokf consumes these spans to split compound commands.

Tests

Adds regression tests for multibyte in non-final words, compound segments, pipelines, and astral-plane emoji (🎉). Full suite: 262 passing, clippy + fmt clean. No ASCII spans change (char count == byte len for ASCII), so no existing test moved.

🤖 Generated with Claude Code

The lexer scans the source as a `Vec<char>`, so token positions and AST spans are character indices. Span ends, however, were computed as `pos + value.len()` — adding the token value's *byte* length to a *character* position. For any token containing multibyte UTF-8 the end overshot by the byte/char delta, so `Node::source_text` returned the wrong slice for every non-final token (the final token's overshoot clamped harmlessly to the source end, which is why the existing multibyte test never caught it). Use `value.chars().count()` at the four sites that derive an end offset from a start position (`last_token_end`, the two direct `Span::new` calls for word nodes, and `Token::adjacent_to`), and correct the `Span` doc comment to state that offsets are character indices. Surfaced downstream as tokf #383: compound shell commands mixing an emoji/checkmark `echo` with a rewritten pipe were spliced into invalid shell. Adds regression tests for multibyte in non-final words, compound segments, pipelines, and astral-plane emoji. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- ## [0.2.1](rable-v0.2.0...rable-v0.2.1) (2026-06-20) ### Bug Fixes * **lexer:** compute span ends by character count, not byte length ([#72](#72)) ([58b2e50](58b2e50)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>

Closes #383 ## Summary Compound shell commands that mixed a multibyte UTF-8 character (`✓`, `✗`, emoji, CJK) in one segment with a chain operator (`&&`/`||`/`;`/newline) were rewritten into **invalid shell** — spurious `|`/`;` tokens were spliced at the wrong byte offsets, so the command failed to parse. This broke every `tokf hook`-driven rewrite that mixed an emoji/checkmark `echo` with a filterable pipe. ### Before / after ``` in: echo "✓ valid" || echo "✗ bad"; ls ~/.claude/skills/ | grep tokf was: echo "✓ valid" | |echo "✗ bad";tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/ ← invalid now: echo "✓ valid" || echo "✗ bad"; tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/ ← valid ``` ## Root cause Two layers: 1. **Upstream (rable):** AST spans are character indices (the lexer scans a `Vec<char>`), but span *ends* were computed as `pos + value.len()` — a char position plus a **byte** length. For multibyte input the end overshot, so even `Node::source_text` returned the wrong slice for non-final tokens. Fixed in **mpecan/rable#72**, released as **rable 0.2.1**. 2. **tokf:** `compound_segments`/`flatten_segments` sliced the byte-indexed source string by raw (character) span values. Now routed through a `char_to_byte` helper that mirrors rable's own conversion. The issue's suspected culprit (`rules.rs:75`, the `{rest}` slice) was a red herring — regex offsets are byte offsets by design. ## Changes - `char_to_byte` helper + the two separator-slice sites in `rewrite/bash_ast.rs`. - Bump `rable` 0.1.15 → 0.2.1 (absorbs the 0.2.0 API surface changes; tokf only uses `rable::parse` + `rable::ast`). ## Tests Three layers of regression coverage — all previously failing, now green: - **AST round-trip + exact separators** — new `bash_ast_multibyte_tests.rs` (split out to stay under the 700-line file limit). - **In-process rewrite** — `tests_compound.rs`. - **Real binary an LLM hook invokes** — `cli_rewrite.rs`. `cargo test -p tokf` (989 lib + 59 CLI), clippy, and fmt all clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mpecan merged commit 58b2e50 into main Jun 20, 2026
5 checks passed

mpecan deleted the fix/multibyte-span-offsets branch June 20, 2026 08:08

repository-butler Bot mentioned this pull request Jun 20, 2026

chore(main): release rable 0.2.1 #73

Merged

mpecan mentioned this pull request Jun 20, 2026

fix(rewrite): correct multibyte UTF-8 compound rewrites (#383) mpecan/tokf#395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lexer): compute span ends by character count, not byte length#72

fix(lexer): compute span ends by character count, not byte length#72
mpecan merged 1 commit into
mainfrom
fix/multibyte-span-offsets

mpecan commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpecan commented Jun 20, 2026

Summary

Example (before)

After

The fix

Downstream impact

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant