-
Notifications
You must be signed in to change notification settings - Fork 13.6k
rustdoc-search: search backend with partitioned suffix tree #144476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
notriddle
wants to merge
2
commits into
rust-lang:master
Choose a base branch
from
notriddle:notriddle/stringdex
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+8,002
−5,033
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This comment has been minimized.
This comment has been minimized.
3f21c90
to
a3603c7
Compare
This comment has been minimized.
This comment has been minimized.
a3603c7
to
278838f
Compare
278838f
to
96b5862
Compare
This comment has been minimized.
This comment has been minimized.
96b5862
to
799c605
Compare
This comment has been minimized.
This comment has been minimized.
799c605
to
43cb8d0
Compare
This comment has been minimized.
This comment has been minimized.
43cb8d0
to
73790db
Compare
This comment has been minimized.
This comment has been minimized.
73790db
to
3abf745
Compare
This comment has been minimized.
This comment has been minimized.
3abf745
to
1bff6c0
Compare
This comment has been minimized.
This comment has been minimized.
1bff6c0
to
29a0c60
Compare
This comment has been minimized.
This comment has been minimized.
29a0c60
to
db040ed
Compare
The job Click to see the possible cause of the failure (guessed by this bot)
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-CI
Area: Our Github Actions CI
A-rustdoc-search
Area: Rustdoc's search feature
A-testsuite
Area: The testsuite used to check the correctness of rustc
S-waiting-on-author
Status: This is awaiting some action (such as code changes or more information) from the author.
T-infra
Relevant to the infrastructure team, which will review and decide on the PR/issue.
T-rustdoc
Relevant to the rustdoc team, which will review and decide on the PR/issue.
T-rustdoc-frontend
Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before: https://notriddle.com/windows-docs-rs/doc-old/windows/
After: https://notriddle.com/windows-docs-rs/doc/windows/
Summary
Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called stringdex, that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, incremental algorithm.
Motivation
Fixes #131156
While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's.
Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever place new matches on top of old ones. It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running.
Explanation
A longer description of how name search works can be found in stringdex's HACKING.md file.
Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match.
Drawbacks
It's rather complex, and takes up more disk space than the current flat list of strings.
Rationale and alternatives
Instead of a suffix tree, I could've used a different approximate matching data structure. I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match oepn like the current system does).
Prior art
Sherlodoc is based on a similar concept, but they:
Future possibilities
Low-level optimization in stringdex
There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself.
Improved recall in type-driven search
Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want.
What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise.
For example, a type theoretical improvement would fix
Iterator<T>, (T -> U) -> Iterator<U>
to giveIterator::map
, because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one.Full-text search
Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching.
Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks:
Neither of these problems are deal-breakers, but they're worth keeping in mind.