Commit 75fb8f9
authored
rebuild BM25 from nodes and harden index invariants (#883)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR hardens the BM25 full-text search subsystem by introducing a
**reverse index** (`doc_id → list of (term, tf)`) alongside the existing
inverted index, enabling correct and efficient delete/update operations
without re-scanning the inverted index. It also adds **schema
versioning** so that on startup, if the stored index was built without
the reverse index, the migration layer automatically clears and rebuilds
the entire BM25 index from `nodes_db`. The `add_n`, `update`, and
`upsert` traversal operators are updated to maintain BM25 invariants as
nodes are created or changed.
Key changes:
- **`bm25.rs`**: Adds `ReversePostingEntry`, `DocPresence`/`DocState`
enums, `reverse_index_db` (DUP_SORT), schema version read/write, and
refactored `insert_doc` / `delete_doc` / `update_doc` that enforce
strict consistency invariants (returns errors on index corruption).
- **`storage_migration.rs`**: `migrate_bm25` clears and rebuilds the
BM25 index in 1 024-node batches when the stored schema version doesn't
match `BM25_SCHEMA_VERSION = 2`.
- **`add_n.rs`**: `bm25.insert_doc` is called after writing the node —
but without a guard that checks whether the `nodes_db` write succeeded,
allowing BM25 errors to mask prior errors.
- **`update.rs` / `upsert.rs`**: BM25 is correctly updated before
(update) or after (upsert new node) the `nodes_db` write, with proper
error propagation.
- Comprehensive new tests cover the reverse index, schema migration,
update-to-searchable, and invariant enforcement.
<details><summary><h3>Important Files Changed</h3></summary>
| Filename | Overview |
|----------|----------|
| helix-db/src/helix_engine/bm25/bm25.rs | Core BM25 rewrite adding a
reverse index (doc_id → terms) and schema versioning. Logic is sound
overall; minor duplication between reverse_entries and
reverse_entries_rw. |
| helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs | BM25
insert_doc is called unconditionally after nodes_db write, even on prior
failure — can mask the original error if BM25 also fails. |
| helix-db/src/helix_engine/storage_core/storage_migration.rs | Adds
migrate_bm25 that clears and rebuilds the BM25 index from nodes_db.
Holding read_txn alive across batch write transactions can cause LMDB
freelist growth; nodes without properties are silently skipped
(consistent with add_n but deserves a comment). |
| helix-db/src/helix_engine/traversal_core/ops/util/update.rs |
Correctly calls bm25.update_doc before nodes_db write; BM25 error
short-circuits the node save, and both are in the same transaction so
rollback is safe. |
| helix-db/src/helix_engine/traversal_core/ops/util/upsert.rs | BM25
insert_doc (new node) and update_doc (existing node) are correctly
integrated into upsert_n; error propagation follows the existing ?
pattern. |
</details>
</details>
<details><summary><h3>Sequence Diagram</h3></summary>
```mermaid
sequenceDiagram
participant Caller
participant migrate_bm25
participant BM25Index
participant nodes_db
participant add_n / update / upsert
Note over Caller, BM25Index: Startup Migration Path
Caller->>migrate_bm25: migrate(storage)
migrate_bm25->>BM25Index: schema_version(read_txn)
alt version matches BM25_SCHEMA_VERSION
BM25Index-->>migrate_bm25: Some(2) — skip
else outdated or missing
migrate_bm25->>BM25Index: clear_all(write_txn)
migrate_bm25->>nodes_db: iter(read_txn) — batch by 1024
loop Each batch
migrate_bm25->>BM25Index: insert_doc per node with properties (write_txn)
end
migrate_bm25->>BM25Index: write_schema_version(2, write_txn)
end
Note over Caller, BM25Index: Normal Write Path
Caller->>add_n / update / upsert: write op (RwTxn)
add_n / update / upsert->>nodes_db: put node
add_n / update / upsert->>BM25Index: insert_doc / update_doc / delete_doc
BM25Index->>BM25Index: update inverted_index_db (term→postings)
BM25Index->>BM25Index: update reverse_index_db (doc_id→terms)
BM25Index->>BM25Index: update doc_lengths_db
BM25Index->>BM25Index: update term_frequencies_db
BM25Index->>BM25Index: update metadata (total_docs, avgdl)
add_n / update / upsert-->>Caller: Result<TraversalValue>
```
</details>
<!-- greptile_failed_comments -->
<details><summary><h3>Comments Outside Diff (1)</h3></summary>
1. `helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs`, line
141-148
([link](https://github.com/helixdb/helix-db/blob/8d28406c1ec70c56ba3100205080a9bb7a9ea17c/helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs#L141-L148))
**BM25 insert runs unconditionally, can mask prior errors**
The BM25 `insert_doc` call (lines 141-148) runs even when a previous
operation failed (secondary-index insertion or
`nodes_db.put_with_flags`). If `bm25.insert_doc` then also fails, it
silently overwrites the original error stored in `result`, making the
caller receive a BM25 error instead of the real `nodes_db` or
secondary-index error. Adding an early-exit guard fixes both the error
masking and the wasted BM25 work:
</details>
<!-- /greptile_failed_comments -->
<sub>Last reviewed commit: 8d28406</sub>
<!-- /greptile_comment -->File tree
9 files changed
+1193
-179
lines changed- helix-db/src/helix_engine
- bm25
- storage_core
- tests/traversal_tests
- traversal_core/ops
- source
- util
9 files changed
+1193
-179
lines changedLarge diffs are not rendered by default.
Large diffs are not rendered by default.
Lines changed: 71 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
8 | | - | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| 42 | + | |
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
44 | 46 | | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
45 | 115 | | |
46 | 116 | | |
47 | 117 | | |
| |||
Lines changed: 186 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
17 | | - | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
22 | 30 | | |
23 | 31 | | |
| 32 | + | |
24 | 33 | | |
| 34 | + | |
25 | 35 | | |
26 | 36 | | |
27 | 37 | | |
| |||
169 | 179 | | |
170 | 180 | | |
171 | 181 | | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
172 | 221 | | |
173 | 222 | | |
174 | 223 | | |
| |||
960 | 1009 | | |
961 | 1010 | | |
962 | 1011 | | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
| 1041 | + | |
| 1042 | + | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
| 1057 | + | |
| 1058 | + | |
| 1059 | + | |
| 1060 | + | |
| 1061 | + | |
| 1062 | + | |
| 1063 | + | |
| 1064 | + | |
| 1065 | + | |
| 1066 | + | |
| 1067 | + | |
| 1068 | + | |
| 1069 | + | |
| 1070 | + | |
| 1071 | + | |
| 1072 | + | |
| 1073 | + | |
| 1074 | + | |
| 1075 | + | |
| 1076 | + | |
| 1077 | + | |
| 1078 | + | |
| 1079 | + | |
| 1080 | + | |
| 1081 | + | |
| 1082 | + | |
| 1083 | + | |
| 1084 | + | |
| 1085 | + | |
| 1086 | + | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
963 | 1146 | | |
964 | 1147 | | |
965 | 1148 | | |
| |||
Lines changed: 41 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
0 commit comments