Skip to content

Conversation

simonwuelker
Copy link
Contributor

@simonwuelker simonwuelker commented Jul 23, 2025

Currently named character references are implemented using a phf map that is repeatedly queried for each character. This works, but has suboptimal performance and a significant impact on binary size.

Traversing a DAFSA that is generated at compile time makes tokenizing named character references 30% faster. This technique is described in https://www.ryanliptak.com/blog/better-named-character-reference-tokenization/. For illustration, a reduced version of the dafsa is can be viewed here.

Apologies for the big change. If it's too hard to review we could also merge the DAFSA incrementally. Most of the diff is the list of named entities being moved around and a benchmark file being added.

I have not looked into how this affects the binary size. Some more savings are possible by packing the array of result characters.

Do not merge yet - needs servo companion PR.

This is a breaking change for markup5ever and web_atoms.

@nicoburns
Copy link
Contributor

Would it be possible to separate out the DAFSA so that it doesn't depend on markup5ever traits? A small/fast HTML entity parser (possibly following a "sans-io" design?) seems like a really useful standalone crate.

btw, is a "DAFSA" the same thing as what burntsushi calls a "DFA" in his blogposts on the regex crate?

@simonwuelker
Copy link
Contributor Author

Would it be possible to separate out the DAFSA so that it doesn't depend on markup5ever traits? A small/fast HTML entity parser (possibly following a "sans-io" design?) seems like a really useful standalone crate.

Yeah sure, I wasn't (and am still not) sure if adding another crate is worth it. Though it would likely only require very little maintenance, because the list of named entities will never change...

@simonwuelker
Copy link
Contributor Author

btw, is a "DAFSA" the same thing as what burntsushi calls a "DFA" in his blogposts on the regex crate?

Pretty much, yes - a DFA is a deterministic finite state automaton. DAFSAs are a subset of that, which are acyclic. (DAFSA = deterministic acyclic finite state automaton).

@simonwuelker simonwuelker marked this pull request as ready for review July 30, 2025 18:00
Comment on lines +258 to +259
// Parse the list of named entities from https://html.spec.whatwg.org/entities.json
let input_file = BufReader::new(File::open("build/entities.json").unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider converting the JSON file into a Rust file. Then you wouldn't need to depend on serde in the build.rs.

Comment on lines +16 to +18
/// For memory efficiency reasons, this is packed in 32 bits. The memory representation is as follows:
/// * 8 bits: code point
/// * 8 bits: hash value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing part of the memory layout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V-breaking Breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants