Remove Whitespace Tokens from Parser #2077

LucaCappelletti94 · 2025-10-29T10:08:49Z

This PR implements a significant architectural refactoring by moving whitespace filtering from the parser to the tokenizer. Instead of emitting whitespace tokens (spaces, tabs, newlines, comments) and filtering them throughout the parser logic, the tokenizer now consumes whitespace during tokenization and never emits these tokens.

While some duplicated logic still remains in the parser (to be addressed in future PRs), this change eliminates a substantial amount of looping overhead. This PR sets the groundwork for a cleaner streaming version, where the tokens are parsed simultaneously as the statements, with no parser memory and only local context passed between parser function calls.

Fixes #2076

Motivation

As discussed in #2076, whitespace tokens were being filtered at numerous points throughout the parser. This approach had several drawbacks:

Poor separation of concerns: Whitespace handling was scattered across both tokenizer and parser
Memory overhead: Whitespace tokens were stored in memory unnecessarily
Code duplication: Multiple loops throughout the parser to skip whitespace tokens, looking ahead or backwards for non-whitespace tokens
Performance: Each token access required checking and skipping whitespace tokens

The parser had extensive whitespace-handling logic scattered throughout:

Functions with whitespace-skipping loops:

peek_tokens_with_location - loops to skip whitespace
peek_tokens_ref - loops to skip whitespace
peek_nth_token_ref - loops to skip whitespace
advance_token - loops to skip whitespace
prev_token - loops backward to skip whitespace

Special variant functions that are now obsolete:

peek_token_no_skip - removed entirely (no longer needed)
peek_nth_token_no_skip - removed entirely (no longer needed)
next_token_no_skip - removed entirely (no longer needed)

Since SQL is not a whitespace-sensitive language (unlike Python), so it should be safe to remove whitespace tokens entirely after tokenization.

Handling Edge Cases

While SQL is generally not whitespace-sensitive, there are specific edge cases that require careful consideration:

1. PostgreSQL COPY FROM STDIN

The COPY FROM STDIN statement requires preserving the actual data content, which may include meaningful whitespace and newlines. The data section is treated as raw input that should be parsed according to the specified format (tab-delimited, CSV, etc.).

Solution: The tokenizer now properly handles this by consuming the data as a single token. The parser then actually parses the body of the CSV-like string, which was not actually done correctly before this refactoring. I have extended the associated tests appropriately.

2. Hyphenated and path identifiers

The tokenizer now includes enhanced logic for hyphenated identifier parsing with proper validation:

Detects when hyphens/paths/tildes are part of identifiers vs. operators
Validates that identifiers don't start with digits after hyphens
Ensures identifiers don't end with trailing hyphens
Handles the whitespace-dependent context correctly

LucaCappelletti94 · 2025-10-29T10:39:26Z

The newly added dependency csv does not have an analogous no-std version, so the alloc-only test is currently failing. I will see whether a PR to csv adding support for an alloc-only mode is possible. I have used csv because I do not want to re-implement CSV-like document parsing in sqlparser.

LucaCappelletti94 · 2025-10-29T17:44:45Z

I have decided to replace the csv crate with a custom solution as right now that crate does not provide an alloc-only feature, and it seems rather complex to make a PR making it available.

Viicos · 2025-11-11T17:53:47Z

src/tokenizer.rs

-pub enum Whitespace {
-    Space,
-    Newline,
-    Tab,
-    SingleLineComment { comment: String, prefix: String },
-    MultiLineComment(String),
-}


While it would make sense for the parser to not have to deal with whitespace including comments, it will be useful to preserve comment tokens. This is really useful when trying to provide IDE completions, where you don't want anything to be suggested in inside a comment.

Hi @Viicos, I have nothing against comments - in fact, a PhD student in my team is working on the #2069 PR for handling them in a more structured manner in the AST. They just should not exist in the form of whitespace. Could you kindly expand on your IDE completions note? I am not sure I understood it.

They just should not exist in the form of whitespace.

Agree, maybe they could be made their own token.

Could you kindly expand on your IDE completions note? I am not sure I understood it.

I'm currently experimenting with the tokenizer and parser to use it in a SQL query input we have for our project at work. From a (line, col) position, I'm writing some logic to provide useful completions at this position (by completions, I mean this).

If you are currently writing a comment, the token immediately before the position is going to be a comment token, and in this case I can short-circuit and returns an empty list of competions.

As an example, this is how the Ruff project (they are working on a Python language server) is doing it.

Actually having a comment token kind would defeat the purpose of this PR, because the logic in the parser to skip those comment tokens would be the same.

The Ruff parser solves this by having a TokenSource struct, acting as a bridge between the lexer/tokenizer and parser. It has a couple methods to bump the tokens, ignoring the trivia tokens (in our case, that would only be the comment tokens). Maybe we could take inspiration from this pattern?

Somewhat - the idea in the aforementioned PR was to have the concept of leading and interstitial comment, with the formed getting into the ast and the latter being dropped as white space

I see; I still think it would be important for any kind of comment to be preserved as tokens (my use case is related to completions, but it can be also useful if you want to implement a formatting mechanism that preserves comments, or if a comment can be used to e.g. specify a linter directive -- presumably would be relevant for interstitial comments as well).

I commented some thoughts on this PR ;)

LucaCappelletti94 added 9 commits October 28, 2025 15:27

Started to remove whitespace

52338d6

Extended placeholder syntax test and moved check in tokenizer

c75f11b

Made test_table_ident_err more verbose

1b8d716

Added handling of CSVs in COPY STDIN

b862dc7

Extended CSV STDIN tests and resolved more corner cases in tokenizer

93ea5d2

Tentatively added support for path identifiers

819c095

Tentatively fixed snowflake ident in @-prefixed paths

7ea9746

Fixed broken doc test

c6c391c

Fixed code smells

5120d8c

This was referenced Oct 29, 2025

Making serde an optional dependency BurntSushi/rust-csv#412

Open

Add optional serde feature (enabled by default) BurntSushi/rust-csv#413

Open

Eliminating whitespace from the parser logic #2076

Open

Replaced CSV with custom csv parser

07a828f

Viicos reviewed Nov 11, 2025

View reviewed changes

Viicos mentioned this pull request Nov 11, 2025

Leading comment support added for AlterTable, CreateTable, and ColumnDef #2069

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove Whitespace Tokens from Parser #2077

Remove Whitespace Tokens from Parser #2077

LucaCappelletti94 commented Oct 29, 2025

Uh oh!

LucaCappelletti94 commented Oct 29, 2025

Uh oh!

LucaCappelletti94 commented Oct 29, 2025

Uh oh!

Viicos Nov 11, 2025

Uh oh!

LucaCappelletti94 Nov 11, 2025

Uh oh!

Viicos Nov 11, 2025

Uh oh!

Viicos Nov 11, 2025

Uh oh!

LucaCappelletti94 Nov 11, 2025

Uh oh!

Viicos Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove Whitespace Tokens from Parser #2077

Are you sure you want to change the base?

Remove Whitespace Tokens from Parser #2077

Conversation

LucaCappelletti94 commented Oct 29, 2025

Motivation

Handling Edge Cases

1. PostgreSQL COPY FROM STDIN

2. Hyphenated and path identifiers

Uh oh!

LucaCappelletti94 commented Oct 29, 2025

Uh oh!

LucaCappelletti94 commented Oct 29, 2025

Uh oh!

Viicos Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

LucaCappelletti94 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Viicos Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Viicos Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

LucaCappelletti94 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Viicos Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants