Skip to content

fix(lexer): compute span ends by character count, not byte length#72

Merged
mpecan merged 1 commit into
mainfrom
fix/multibyte-span-offsets
Jun 20, 2026
Merged

fix(lexer): compute span ends by character count, not byte length#72
mpecan merged 1 commit into
mainfrom
fix/multibyte-span-offsets

Conversation

@mpecan

@mpecan mpecan commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

The lexer scans the source as a Vec<char>, so token positions and AST Spans are character indices. But span ends were computed as pos + value.len() — adding the token value's byte length to a character position. For any token containing multibyte UTF-8, the end overshoots by the byte/char delta, so Node::source_text returns the wrong slice for every non-final token.

The existing source_text_multibyte_utf8 test never caught this because its multibyte word was the last token, where the overshoot clamps harmlessly to the source end.

Example (before)

echo "✓ valid" || echo "✗ bad"
  Command 0..16  source_text = `echo "✓ valid" |`   ← wrong (bleeds into the `||`)

After

  Command 0..14  source_text = `echo "✓ valid"`     ← correct

The fix

Use value.chars().count() at the four sites that derive an end offset from a start position:

  • lexer/mod.rslast_token_end (the source of most node ends via parser/mod.rs:134)
  • parser/simple_command.rs — word-node span
  • parser/case_select.rs — pattern word span
  • token.rsToken::adjacent_to

…and correct the Span doc comment to state offsets are character indices.

Downstream impact

Surfaced as tokf #383: compound shell commands mixing an emoji// echo with a rewritten pipe were spliced into invalid shell (| ||, dropped ; ). tokf consumes these spans to split compound commands.

Tests

Adds regression tests for multibyte in non-final words, compound segments, pipelines, and astral-plane emoji (🎉). Full suite: 262 passing, clippy + fmt clean. No ASCII spans change (char count == byte len for ASCII), so no existing test moved.

🤖 Generated with Claude Code

The lexer scans the source as a `Vec<char>`, so token positions and AST
spans are character indices. Span ends, however, were computed as
`pos + value.len()` — adding the token value's *byte* length to a
*character* position. For any token containing multibyte UTF-8 the end
overshot by the byte/char delta, so `Node::source_text` returned the
wrong slice for every non-final token (the final token's overshoot
clamped harmlessly to the source end, which is why the existing
multibyte test never caught it).

Use `value.chars().count()` at the four sites that derive an end offset
from a start position (`last_token_end`, the two direct `Span::new`
calls for word nodes, and `Token::adjacent_to`), and correct the `Span`
doc comment to state that offsets are character indices.

Surfaced downstream as tokf #383: compound shell commands mixing an
emoji/checkmark `echo` with a rewritten pipe were spliced into invalid
shell. Adds regression tests for multibyte in non-final words, compound
segments, pipelines, and astral-plane emoji.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mpecan mpecan merged commit 58b2e50 into main Jun 20, 2026
5 checks passed
@mpecan mpecan deleted the fix/multibyte-span-offsets branch June 20, 2026 08:08
mpecan pushed a commit that referenced this pull request Jun 20, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.2.1](rable-v0.2.0...rable-v0.2.1)
(2026-06-20)


### Bug Fixes

* **lexer:** compute span ends by character count, not byte length
([#72](#72))
([58b2e50](58b2e50))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
mpecan added a commit to mpecan/tokf that referenced this pull request Jun 20, 2026
Closes #383

## Summary

Compound shell commands that mixed a multibyte UTF-8 character (`✓`,
`✗`, emoji, CJK) in one segment with a chain operator
(`&&`/`||`/`;`/newline) were rewritten into **invalid shell** — spurious
`|`/`;` tokens were spliced at the wrong byte offsets, so the command
failed to parse. This broke every `tokf hook`-driven rewrite that mixed
an emoji/checkmark `echo` with a filterable pipe.

### Before / after

```
in:   echo "✓ valid" || echo "✗ bad"; ls ~/.claude/skills/ | grep tokf
was:  echo "✓ valid" | |echo "✗ bad";tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/   ← invalid
now:  echo "✓ valid" || echo "✗ bad"; tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/   ← valid
```

## Root cause

Two layers:

1. **Upstream (rable):** AST spans are character indices (the lexer
scans a `Vec<char>`), but span *ends* were computed as `pos +
value.len()` — a char position plus a **byte** length. For multibyte
input the end overshot, so even `Node::source_text` returned the wrong
slice for non-final tokens. Fixed in **mpecan/rable#72**, released as
**rable 0.2.1**.
2. **tokf:** `compound_segments`/`flatten_segments` sliced the
byte-indexed source string by raw (character) span values. Now routed
through a `char_to_byte` helper that mirrors rable's own conversion.

The issue's suspected culprit (`rules.rs:75`, the `{rest}` slice) was a
red herring — regex offsets are byte offsets by design.

## Changes

- `char_to_byte` helper + the two separator-slice sites in
`rewrite/bash_ast.rs`.
- Bump `rable` 0.1.15 → 0.2.1 (absorbs the 0.2.0 API surface changes;
tokf only uses `rable::parse` + `rable::ast`).

## Tests

Three layers of regression coverage — all previously failing, now green:
- **AST round-trip + exact separators** — new
`bash_ast_multibyte_tests.rs` (split out to stay under the 700-line file
limit).
- **In-process rewrite** — `tests_compound.rs`.
- **Real binary an LLM hook invokes** — `cli_rewrite.rs`.

`cargo test -p tokf` (989 lib + 59 CLI), clippy, and fmt all clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant