Skip to content

rustdoc-search: search backend with partitioned suffix tree #144476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Jul 26, 2025

Before: https://notriddle.com/windows-docs-rs/doc-old/windows/

After: https://notriddle.com/windows-docs-rs/doc/windows/

Summary

Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called stringdex, that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, incremental algorithm.

Motivation

Fixes #131156

While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's.

Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever place new matches on top of old ones. It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running.

Explanation

A longer description of how name search works can be found in stringdex's HACKING.md file.

Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match.

Drawbacks

It's rather complex, and takes up more disk space than the current flat list of strings.

Rationale and alternatives

Instead of a suffix tree, I could've used a different approximate matching data structure. I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match oepn like the current system does).

Prior art

Sherlodoc is based on a similar concept, but they:

  • use edit distance over a suffix tree for type-based search, instead of the binary matching that's implemented here
  • use substring matching for name-based search, but not fuzzy name matching
  • actually implement body text search, which is a potential-future feature, but not implemented in this PR

Future possibilities

Low-level optimization in stringdex

There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself.

  • Right now, a search tree node's children are parallel lists of bytes and some kind of pointer. The pointer is reasonably optimized, but I don't currently do anything smart for the edge label itself, and I know that a bitmap would be able to more compactly represent most nodes with more than three child branches if I selected the 24 most common letters of the alphabet (for arbitrary bytes, the cutoff is ten child branches).
  • The encoding uses length-delimited fields with one byte. This pessimises the very common case where a node has zero of whatever it is I'm counting. I'd be better off with a bit flag to pick a representation that skips the field entirely.
  • Stringdex decides whether to bundle two nodes into the same file based on size. To figure out a node's size, I have to run compression on it. This is probably slower than it needs to be.
  • Stack compression is limited to the same 256-slot sliding windows as backref compression, and it doesn't have to be. (stack and backref compression are used to optimize the representation of the edge pointer from a parent node to its child; backref uses one byte, while stack is entirely implicit)
  • The JS-side decoder is pretty naive. It performs unnecessary hash table lookups when decoding compressed nodes, and retains a list of hashes that it doesn't need. It needs to calculate the hashes in order to construct the merkle tree correctly, but it doesn't need to keep them.
  • Data compression happens at the end, while emitting the node. This means it's not being counted when deciding on how to bundle, which is pretty dumb.

Improved recall in type-driven search

Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want.

What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise.

For example, a type theoretical improvement would fix Iterator<T>, (T -> U) -> Iterator<U> to give Iterator::map, because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one.

Full-text search

Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching.

Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks:

  • You have to pick between index bloat and the use of stopwords. Stopwords are bad because they might actually be important (try searching "if let" in mdBook if you're feeling brave), but without them, you spend a lot of space on text that doesn't matter.
  • Example code also tends to have a lot of irrelevant stuff in it. Like stop words, we'd have to pick potentially-confusing or bloat.

Neither of these problems are deal-breakers, but they're worth keeping in mind.

@rustbot rustbot added A-CI Area: Our Github Actions CI A-rustdoc-search Area: Rustdoc's search feature A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output. labels Jul 26, 2025
@notriddle notriddle removed A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI labels Jul 26, 2025
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 3f21c90 to a3603c7 Compare July 28, 2025 16:44
@rustbot rustbot added A-CI Area: Our Github Actions CI A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Jul 28, 2025
@notriddle notriddle removed A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI labels Jul 28, 2025
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from a3603c7 to 278838f Compare July 28, 2025 17:09
@rustbot rustbot added A-CI Area: Our Github Actions CI A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Jul 28, 2025
@notriddle notriddle force-pushed the notriddle/stringdex branch from 278838f to 96b5862 Compare July 28, 2025 17:13
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 96b5862 to 799c605 Compare July 28, 2025 17:31
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 799c605 to 43cb8d0 Compare July 28, 2025 17:44
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 43cb8d0 to 73790db Compare July 28, 2025 18:43
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 73790db to 3abf745 Compare July 28, 2025 18:55
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 3abf745 to 1bff6c0 Compare July 28, 2025 19:42
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 1bff6c0 to 29a0c60 Compare July 28, 2025 21:06
@rust-log-analyzer

This comment has been minimized.

@notriddle notriddle force-pushed the notriddle/stringdex branch from 29a0c60 to db040ed Compare July 28, 2025 22:13
@rust-log-analyzer
Copy link
Collaborator

The job aarch64-gnu-llvm-19-1 failed! Check out the build log: (web) (plain enhanced) (plain)

Click to see the possible cause of the failure (guessed by this bot)
---- [run-make] tests/run-make/emit-shared-files stdout ----

error: rmake recipe failed to complete
status: exit status: 101
command: cd "/checkout/obj/build/aarch64-unknown-linux-gnu/test/run-make/emit-shared-files/rmake_out" && env -u RUSTFLAGS AR="ar" BUILD_ROOT="/checkout/obj/build/aarch64-unknown-linux-gnu" CARGO="/checkout/obj/build/aarch64-unknown-linux-gnu/stage1-tools-bin/cargo" CC="cc" CC_DEFAULT_FLAGS="-ffunction-sections -fdata-sections -fPIC" CXX="c++" CXX_DEFAULT_FLAGS="-ffunction-sections -fdata-sections -fPIC" HOST_RUSTC_DYLIB_PATH="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/lib" LD_LIBRARY_PATH="/checkout/obj/build/aarch64-unknown-linux-gnu/bootstrap-tools/aarch64-unknown-linux-gnu/release/deps:/checkout/obj/build/aarch64-unknown-linux-gnu/stage0/lib/rustlib/aarch64-unknown-linux-gnu/lib" LD_LIB_PATH_ENVVAR="LD_LIBRARY_PATH" LLVM_BIN_DIR="/usr/lib/llvm-19/bin" LLVM_COMPONENTS="aarch64 aarch64asmparser aarch64codegen aarch64desc aarch64disassembler aarch64info aarch64utils aggressiveinstcombine all all-targets amdgpu amdgpuasmparser amdgpucodegen amdgpudesc amdgpudisassembler amdgpuinfo amdgputargetmca amdgpuutils analysis arm armasmparser armcodegen armdesc armdisassembler arminfo armutils asmparser asmprinter avr avrasmparser avrcodegen avrdesc avrdisassembler avrinfo binaryformat bitreader bitstreamreader bitwriter bpf bpfasmparser bpfcodegen bpfdesc bpfdisassembler bpfinfo cfguard codegen codegendata codegentypes core coroutines coverage debuginfobtf debuginfocodeview debuginfodwarf debuginfogsym debuginfologicalview debuginfomsf debuginfopdb demangle dlltooldriver dwarflinker dwarflinkerclassic dwarflinkerparallel dwp engine executionengine extensions filecheck frontenddriver frontendhlsl frontendoffloading frontendopenacc frontendopenmp fuzzercli fuzzmutate globalisel hexagon hexagonasmparser hexagoncodegen hexagondesc hexagondisassembler hexagoninfo hipstdpar instcombine instrumentation interfacestub interpreter ipo irprinter irreader jitlink lanai lanaiasmparser lanaicodegen lanaidesc lanaidisassembler lanaiinfo libdriver lineeditor linker loongarch loongarchasmparser loongarchcodegen loongarchdesc loongarchdisassembler loongarchinfo lto m68k m68kasmparser m68kcodegen m68kdesc m68kdisassembler m68kinfo mc mca mcdisassembler mcjit mcparser mips mipsasmparser mipscodegen mipsdesc mipsdisassembler mipsinfo mirparser msp430 msp430asmparser msp430codegen msp430desc msp430disassembler msp430info native nativecodegen nvptx nvptxcodegen nvptxdesc nvptxinfo objcarcopts objcopy object objectyaml option orcdebugging orcjit orcshared orctargetprocess passes perfjitevents powerpc powerpcasmparser powerpccodegen powerpcdesc powerpcdisassembler powerpcinfo profiledata remarks riscv riscvasmparser riscvcodegen riscvdesc riscvdisassembler riscvinfo riscvtargetmca runtimedyld sandboxir scalaropts selectiondag sparc sparcasmparser sparccodegen sparcdesc sparcdisassembler sparcinfo support symbolize systemz systemzasmparser systemzcodegen systemzdesc systemzdisassembler systemzinfo tablegen target targetparser textapi textapibinaryreader transformutils ve veasmparser vecodegen vectorize vedesc vedisassembler veinfo webassembly webassemblyasmparser webassemblycodegen webassemblydesc webassemblydisassembler webassemblyinfo webassemblyutils windowsdriver windowsmanifest x86 x86asmparser x86codegen x86desc x86disassembler x86info x86targetmca xcore xcorecodegen xcoredesc xcoredisassembler xcoreinfo xray xtensa xtensaasmparser xtensacodegen xtensadesc xtensadisassembler xtensainfo" LLVM_FILECHECK="/usr/lib/llvm-19/bin/FileCheck" NODE="/usr/bin/node" PYTHON="/usr/bin/python3" RUSTC="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/bin/rustc" RUSTDOC="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/bin/rustdoc" SOURCE_ROOT="/checkout" TARGET="aarch64-unknown-linux-gnu" TARGET_EXE_DYLIB_PATH="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/lib/rustlib/aarch64-unknown-linux-gnu/lib" __RUSTC_DEBUG_ASSERTIONS_ENABLED="1" __STD_DEBUG_ASSERTIONS_ENABLED="1" "/checkout/obj/build/aarch64-unknown-linux-gnu/test/run-make/emit-shared-files/rmake"
stdout: none
--- stderr -------------------------------

thread 'main' panicked at /checkout/tests/run-make/emit-shared-files/rmake.rs:22:5:
assertion failed: path("invocation-only/search-index-xxx.js").exists()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
------------------------------------------


---- [run-make] tests/run-make/rustdoc-determinism stdout ----

error: rmake recipe failed to complete
status: exit status: 101
command: cd "/checkout/obj/build/aarch64-unknown-linux-gnu/test/run-make/rustdoc-determinism/rmake_out" && env -u RUSTFLAGS AR="ar" BUILD_ROOT="/checkout/obj/build/aarch64-unknown-linux-gnu" CARGO="/checkout/obj/build/aarch64-unknown-linux-gnu/stage1-tools-bin/cargo" CC="cc" CC_DEFAULT_FLAGS="-ffunction-sections -fdata-sections -fPIC" CXX="c++" CXX_DEFAULT_FLAGS="-ffunction-sections -fdata-sections -fPIC" HOST_RUSTC_DYLIB_PATH="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/lib" LD_LIBRARY_PATH="/checkout/obj/build/aarch64-unknown-linux-gnu/bootstrap-tools/aarch64-unknown-linux-gnu/release/deps:/checkout/obj/build/aarch64-unknown-linux-gnu/stage0/lib/rustlib/aarch64-unknown-linux-gnu/lib" LD_LIB_PATH_ENVVAR="LD_LIBRARY_PATH" LLVM_BIN_DIR="/usr/lib/llvm-19/bin" LLVM_COMPONENTS="aarch64 aarch64asmparser aarch64codegen aarch64desc aarch64disassembler aarch64info aarch64utils aggressiveinstcombine all all-targets amdgpu amdgpuasmparser amdgpucodegen amdgpudesc amdgpudisassembler amdgpuinfo amdgputargetmca amdgpuutils analysis arm armasmparser armcodegen armdesc armdisassembler arminfo armutils asmparser asmprinter avr avrasmparser avrcodegen avrdesc avrdisassembler avrinfo binaryformat bitreader bitstreamreader bitwriter bpf bpfasmparser bpfcodegen bpfdesc bpfdisassembler bpfinfo cfguard codegen codegendata codegentypes core coroutines coverage debuginfobtf debuginfocodeview debuginfodwarf debuginfogsym debuginfologicalview debuginfomsf debuginfopdb demangle dlltooldriver dwarflinker dwarflinkerclassic dwarflinkerparallel dwp engine executionengine extensions filecheck frontenddriver frontendhlsl frontendoffloading frontendopenacc frontendopenmp fuzzercli fuzzmutate globalisel hexagon hexagonasmparser hexagoncodegen hexagondesc hexagondisassembler hexagoninfo hipstdpar instcombine instrumentation interfacestub interpreter ipo irprinter irreader jitlink lanai lanaiasmparser lanaicodegen lanaidesc lanaidisassembler lanaiinfo libdriver lineeditor linker loongarch loongarchasmparser loongarchcodegen loongarchdesc loongarchdisassembler loongarchinfo lto m68k m68kasmparser m68kcodegen m68kdesc m68kdisassembler m68kinfo mc mca mcdisassembler mcjit mcparser mips mipsasmparser mipscodegen mipsdesc mipsdisassembler mipsinfo mirparser msp430 msp430asmparser msp430codegen msp430desc msp430disassembler msp430info native nativecodegen nvptx nvptxcodegen nvptxdesc nvptxinfo objcarcopts objcopy object objectyaml option orcdebugging orcjit orcshared orctargetprocess passes perfjitevents powerpc powerpcasmparser powerpccodegen powerpcdesc powerpcdisassembler powerpcinfo profiledata remarks riscv riscvasmparser riscvcodegen riscvdesc riscvdisassembler riscvinfo riscvtargetmca runtimedyld sandboxir scalaropts selectiondag sparc sparcasmparser sparccodegen sparcdesc sparcdisassembler sparcinfo support symbolize systemz systemzasmparser systemzcodegen systemzdesc systemzdisassembler systemzinfo tablegen target targetparser textapi textapibinaryreader transformutils ve veasmparser vecodegen vectorize vedesc vedisassembler veinfo webassembly webassemblyasmparser webassemblycodegen webassemblydesc webassemblydisassembler webassemblyinfo webassemblyutils windowsdriver windowsmanifest x86 x86asmparser x86codegen x86desc x86disassembler x86info x86targetmca xcore xcorecodegen xcoredesc xcoredisassembler xcoreinfo xray xtensa xtensaasmparser xtensacodegen xtensadesc xtensadisassembler xtensainfo" LLVM_FILECHECK="/usr/lib/llvm-19/bin/FileCheck" NODE="/usr/bin/node" PYTHON="/usr/bin/python3" RUSTC="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/bin/rustc" RUSTDOC="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/bin/rustdoc" SOURCE_ROOT="/checkout" TARGET="aarch64-unknown-linux-gnu" TARGET_EXE_DYLIB_PATH="/checkout/obj/build/aarch64-unknown-linux-gnu/stage2/lib/rustlib/aarch64-unknown-linux-gnu/lib" __RUSTC_DEBUG_ASSERTIONS_ENABLED="1" __STD_DEBUG_ASSERTIONS_ENABLED="1" "/checkout/obj/build/aarch64-unknown-linux-gnu/test/run-make/rustdoc-determinism/rmake"
stdout: none
--- stderr -------------------------------

thread 'main' panicked at /checkout/src/tools/run-make-support/src/diff/mod.rs:47:23:
the file in path "/checkout/obj/build/aarch64-unknown-linux-gnu/test/run-make/rustdoc-determinism/rmake_out/foo_first/search-index.js" could not be read into a String: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
------------------------------------------



Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CI Area: Our Github Actions CI A-rustdoc-search Area: Rustdoc's search feature A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rustdoc search is excruciatingly slow on very large crates
3 participants