Conversation
Fixes #864 ## Description <!-- Provide a brief description of the changes in this PR --> ## Related Issues <!-- Link to any related issues using #issue_number --> Closes # ## Checklist when merging to main <!-- Mark items with "x" when completed --> - [ ] No compiler warnings (if applicable) - [ ] Code is formatted with `rustfmt` - [ ] No useless or dead code (if applicable) - [ ] Code is easy to understand - [ ] Doc comments are used for all functions, enums, structs, and fields (where appropriate) - [ ] All tests pass - [ ] Performance has not regressed (assuming change was not to fix a bug) - [ ] Version number has been updated in `helix-cli/Cargo.toml` and `helixdb/Cargo.toml` ## Additional Notes <!-- Add any additional information that would be helpful for reviewers --> <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR successfully fixes bug #864 where `UpsertN` (and `UpsertE`/`UpsertV`) was ignoring schema-defined `DEFAULT` values during node creation and overwriting them during updates. **Key changes:** - `upsert.rs`: Adds `merge_create_props()` helper (explicit props win, defaults fill in the rest), and reworks `upsert_n/e/v` to delegate to new `_with_defaults` variants. Create path uses the merged props; update path correctly ignores `create_defaults`. - `traversal_validation.rs`: Analyzer now collects schema `DEFAULT` fields that are **not** explicitly specified in the upsert statement and passes them as `create_defaults` to the `TraversalType` variants. - `traversal_steps.rs`: `TraversalType::UpsertN/E/V` gains a `create_defaults` field; code generation emits `upsert_n_with_defaults` / `upsert_e_with_defaults` / `upsert_v_with_defaults` calls. - `upsert_tests.rs`: Comprehensive test coverage added for all three upsert types (`upsert_n`, `upsert_e`, `upsert_v`) with their `_with_defaults` variants. Tests verify (a) defaults are applied on creation with explicit values overriding defaults, and (b) defaults are **not** applied when updating an existing node/edge/vector. The implementation is consistent across all three upsert operations and correctly handles both create and update paths. <sub>Last reviewed commit: ae1eb10</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR successfully fixes a bug where `UPDATE` operations applied to a
previously-assigned variable (e.g., `person <- N<Person>(id)` followed
by `updated <- person::UPDATE({...})`) failed to track the source
variable, causing the code generator to ignore the binding and fall back
to an incorrect inline traversal.
**Key Changes:**
- **`TraversalType::Update`** is refactored from a simple tuple variant
into a named-field struct carrying:
- `source`: the bound variable reference (or `None` for inline updates)
- `source_is_plural`: boolean flag for single vs. collection
- `properties`: the update properties
- **Validation Logic** (`traversal_validation.rs`): Before overwriting
the traversal type, the source variable is extracted from the existing
`FromSingle`/`FromIter` traversal type, ensuring it isn't lost.
- **Code Generation** (`traversal_steps.rs`): When a source exists, the
generator now correctly emits:
- `G::new_mut_from` for singular source variables
- `G::new_mut_from_iter` for plural (collection) sources
- Falls back to inline traversal (with read-then-write) when source is
`None`
- **Test Coverage**: New `update_from_var` test suite covers single-node
update from variable, filtered edge update from variable, and inline
update regression check. User-reported test case `user_test_13` is added
to capture the original bug scenario.
The refactoring makes `Update` consistent with existing variants like
`Upsert`, `UpsertN`, `UpsertE`, and `UpsertV` which already use
named-field structs for similar purposes.
<details><summary><h3>Flowchart</h3></summary>
```mermaid
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["StepType::Update encountered in validate_traversal"] --> B["Extract source variable from current traversal type"]
B --> C{Current TraversalType?}
C -->|FromSingle| D["source = Some(var), source_is_plural = false"]
C -->|FromIter| E["source = Some(var), source_is_plural = true"]
C -->|Other| F["source = None"]
D --> G["Set TraversalType::Update struct with source + source_is_plural + properties"]
E --> G
F --> G
G --> H["Code Generation via Display impl"]
H --> I{source is Some?}
I -->|Yes, plural| J["G::new_mut_from_iter - db, txn, var.iter().cloned(), arena"]
I -->|Yes, singular| K["G::new_mut_from - db, txn, var.clone(), arena"]
I -->|No - inline| L["Block: collect read traversal, then G::new_mut_from_iter"]
J --> M[".update properties .collect_to_obj?"]
K --> M
L --> M
```
</details>
<sub>Last reviewed commit: b5d45ed</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
## Description when we initialization helix with `helix init` in a folder that contains a .gitignore with some content that got wriped and overwritten. this PR fixes that ## Related Issues Closes #876 ## Checklist when merging to main <!-- Mark items with "x" when completed --> - [ ] No compiler warnings (if applicable) - [ ] Code is formatted with `rustfmt` - [ ] No useless or dead code (if applicable) - [ ] Code is easy to understand - [ ] Doc comments are used for all functions, enums, structs, and fields (where appropriate) - [ ] All tests pass - [ ] Performance has not regressed (assuming change was not to fix a bug) - [ ] Version number has been updated in `helix-cli/Cargo.toml` and `helixdb/Cargo.toml` ## Additional Notes <!-- Add any additional information that would be helpful for reviewers --> <!-- greptile_comment --> <h3>Greptile Summary</h3> Changed `.gitignore` handling from overwrite to append mode to preserve existing content when running `helix init`. - Reads existing `.gitignore` content before adding new entries - Only appends lines that don't already exist - Fixes issue #876 where existing `.gitignore` was being wiped **Issue found**: The duplicate check uses substring matching (`existing.contains(line)`) which can cause false positives - e.g., if `.gitignore` contains `my-target/`, the check for `target/` will incorrectly skip adding it. <details><summary><h3>Important Files Changed</h3></summary> | Filename | Overview | |----------|----------| | helix-cli/src/commands/init.rs | Fixed gitignore overwriting, but substring matching logic has edge case bugs | </details> </details> <sub>Last reviewed commit: 3e51c60</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->
helix-cli/src/commands/init.rs
Outdated
| fs::write(&gitignore_path, gitignore)?; | ||
| let existing = fs::read_to_string(&gitignore_path).unwrap_or_default(); | ||
| let mut file = fs::OpenOptions::new() | ||
| .create(true) |
There was a problem hiding this comment.
Cleanup tracker deletes pre-existing .gitignore on failure
cleanup_tracker.track_file(gitignore_path) is called unconditionally (line 359), regardless of whether the .gitignore file existed before init ran. Previously this was fine because fs::write always created the file fresh. Now that we append to an existing file, if the init command fails at any later step and triggers cleanup, the tracker will delete a .gitignore that the user already had — along with all their existing content.
The tracking should only happen when the file is newly created:
let file_existed = gitignore_path.exists();
let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
let mut file = fs::OpenOptions::new()
.create(true)
.append(true)
.open(&gitignore_path)?;
for line in gitignore.lines() {
let trimmed = line.trim();
if !existing.lines().any(|l| l.trim() == trimmed) {
writeln!(file, "{}", line)?;
}
}
if !file_existed {
cleanup_tracker.track_file(gitignore_path);
}| let existing = fs::read_to_string(&gitignore_path).unwrap_or_default(); | ||
| let mut file = fs::OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open(&gitignore_path)?; | ||
| for line in gitignore.lines() { | ||
| let trimmed = line.trim(); | ||
| if !existing.lines().any(|l| l.trim() == trimmed) { | ||
| writeln!(file, "{}", line)?; | ||
| } |
There was a problem hiding this comment.
Appending without a guard newline may corrupt the last line of an existing .gitignore
If the existing .gitignore does not end with a newline (which is technically valid), the first writeln! call (line 356) will concatenate its content onto the last line of the existing file. For example, if the file ends with node_modules (no trailing \n), the result would be node_modules.helix/ instead of two separate entries.
Consider writing a leading newline when the existing file is non-empty and does not already end with one:
let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
let mut file = fs::OpenOptions::new()
.create(true)
.append(true)
.open(&gitignore_path)?;
// Ensure we start on a fresh line
if !existing.is_empty() && !existing.ends_with('\n') {
writeln!(file)?;
}
for line in gitignore.lines() {
let trimmed = line.trim();
if !existing.lines().any(|l| l.trim() == trimmed) {
writeln!(file, "{}", line)?;
}
}- Refactor the logic to append new entries to the .gitignore file instead of overwriting existing content. - Ensure that a newline is added before appending new entries if the file already contains content. - Add tests to verify that existing entries are not duplicated and that the .gitignore file is preserved during initialization failures. This change addresses the issue of losing existing .gitignore content when running `helix init`, improving the overall user experience.
Make BM25 explicitly node-only by removing vector indexing, rebuild schema v2 from node state on startup, and fail loudly when forward, reverse, and doc-length state diverge.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR hardens the BM25 full-text search subsystem by introducing a
**reverse index** (`doc_id → list of (term, tf)`) alongside the existing
inverted index, enabling correct and efficient delete/update operations
without re-scanning the inverted index. It also adds **schema
versioning** so that on startup, if the stored index was built without
the reverse index, the migration layer automatically clears and rebuilds
the entire BM25 index from `nodes_db`. The `add_n`, `update`, and
`upsert` traversal operators are updated to maintain BM25 invariants as
nodes are created or changed.
Key changes:
- **`bm25.rs`**: Adds `ReversePostingEntry`, `DocPresence`/`DocState`
enums, `reverse_index_db` (DUP_SORT), schema version read/write, and
refactored `insert_doc` / `delete_doc` / `update_doc` that enforce
strict consistency invariants (returns errors on index corruption).
- **`storage_migration.rs`**: `migrate_bm25` clears and rebuilds the
BM25 index in 1 024-node batches when the stored schema version doesn't
match `BM25_SCHEMA_VERSION = 2`.
- **`add_n.rs`**: `bm25.insert_doc` is called after writing the node —
but without a guard that checks whether the `nodes_db` write succeeded,
allowing BM25 errors to mask prior errors.
- **`update.rs` / `upsert.rs`**: BM25 is correctly updated before
(update) or after (upsert new node) the `nodes_db` write, with proper
error propagation.
- Comprehensive new tests cover the reverse index, schema migration,
update-to-searchable, and invariant enforcement.
<details><summary><h3>Important Files Changed</h3></summary>
| Filename | Overview |
|----------|----------|
| helix-db/src/helix_engine/bm25/bm25.rs | Core BM25 rewrite adding a
reverse index (doc_id → terms) and schema versioning. Logic is sound
overall; minor duplication between reverse_entries and
reverse_entries_rw. |
| helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs | BM25
insert_doc is called unconditionally after nodes_db write, even on prior
failure — can mask the original error if BM25 also fails. |
| helix-db/src/helix_engine/storage_core/storage_migration.rs | Adds
migrate_bm25 that clears and rebuilds the BM25 index from nodes_db.
Holding read_txn alive across batch write transactions can cause LMDB
freelist growth; nodes without properties are silently skipped
(consistent with add_n but deserves a comment). |
| helix-db/src/helix_engine/traversal_core/ops/util/update.rs |
Correctly calls bm25.update_doc before nodes_db write; BM25 error
short-circuits the node save, and both are in the same transaction so
rollback is safe. |
| helix-db/src/helix_engine/traversal_core/ops/util/upsert.rs | BM25
insert_doc (new node) and update_doc (existing node) are correctly
integrated into upsert_n; error propagation follows the existing ?
pattern. |
</details>
</details>
<details><summary><h3>Sequence Diagram</h3></summary>
```mermaid
sequenceDiagram
participant Caller
participant migrate_bm25
participant BM25Index
participant nodes_db
participant add_n / update / upsert
Note over Caller, BM25Index: Startup Migration Path
Caller->>migrate_bm25: migrate(storage)
migrate_bm25->>BM25Index: schema_version(read_txn)
alt version matches BM25_SCHEMA_VERSION
BM25Index-->>migrate_bm25: Some(2) — skip
else outdated or missing
migrate_bm25->>BM25Index: clear_all(write_txn)
migrate_bm25->>nodes_db: iter(read_txn) — batch by 1024
loop Each batch
migrate_bm25->>BM25Index: insert_doc per node with properties (write_txn)
end
migrate_bm25->>BM25Index: write_schema_version(2, write_txn)
end
Note over Caller, BM25Index: Normal Write Path
Caller->>add_n / update / upsert: write op (RwTxn)
add_n / update / upsert->>nodes_db: put node
add_n / update / upsert->>BM25Index: insert_doc / update_doc / delete_doc
BM25Index->>BM25Index: update inverted_index_db (term→postings)
BM25Index->>BM25Index: update reverse_index_db (doc_id→terms)
BM25Index->>BM25Index: update doc_lengths_db
BM25Index->>BM25Index: update term_frequencies_db
BM25Index->>BM25Index: update metadata (total_docs, avgdl)
add_n / update / upsert-->>Caller: Result<TraversalValue>
```
</details>
<!-- greptile_failed_comments -->
<details><summary><h3>Comments Outside Diff (1)</h3></summary>
1. `helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs`, line
141-148
([link](https://github.com/helixdb/helix-db/blob/8d28406c1ec70c56ba3100205080a9bb7a9ea17c/helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs#L141-L148))
**BM25 insert runs unconditionally, can mask prior errors**
The BM25 `insert_doc` call (lines 141-148) runs even when a previous
operation failed (secondary-index insertion or
`nodes_db.put_with_flags`). If `bm25.insert_doc` then also fails, it
silently overwrites the original error stored in `result`, making the
caller receive a BM25 error instead of the real `nodes_db` or
secondary-index error. Adding an early-exit guard fixes both the error
masking and the wasted BM25 work:
</details>
<!-- /greptile_failed_comments -->
<sub>Last reviewed commit: 8d28406</sub>
<!-- /greptile_comment -->
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes sync reconciliation so that the correct remote queries
path (read from the cluster's snapshot `helix.toml`) is used instead of
always falling back to the local project path. It refactors several sync
functions to accept explicit `current_queries_dir` /
`target_queries_dir` arguments, adds enterprise-cluster parity to all
sync flows, extracts a reusable `build_standard_deploy_payload` helper
(now including `runtime_config`), and adds meaningful unit-test coverage
for the new merge/extraction helpers.
**Key issues:**
- `pull_remote_snapshot_into_local` handles directory migration when the
remote queries path differs from the local one, but the migration code
path deletes files from `current_queries_dir` **before** writing to
`target_queries_dir`, risking data loss if a write fails mid-way (the
same-dir code path retains the safe write-then-delete order from the
original implementation).
- `insert_unique_cloud_instance_name` and
`insert_unique_enterprise_instance_name` introduce redundant
double-clones that can be simplified.
<details><summary><h3>Sequence Diagram</h3></summary>
```mermaid
sequenceDiagram
participant CLI
participant Cloud as Helix Cloud API
participant FS as Local Filesystem
Note over CLI: sync (project flow)
CLI->>Cloud: fetch_project_clusters()
CLI->>Cloud: fetch_standard/enterprise_cluster_snapshot_config(selected_cluster)
Cloud-->>CLI: HelixConfig (remote helix.toml)
CLI->>CLI: resolve_selected_project_queries_path(snapshot)
alt Standard cluster
CLI->>Cloud: fetch_sync_response(cluster_id)
Cloud-->>CLI: SyncResponse (.hx files + helix_toml)
CLI->>CLI: compare_manifests(local, remote)
alt Pull
CLI->>FS: pull_remote_snapshot_into_local(current_dir, target_dir, ...)
CLI->>FS: update_project_queries_path_in_helix_toml()
else Push
CLI->>Cloud: push_local_snapshot_to_cluster()
end
else Enterprise cluster
CLI->>Cloud: fetch_enterprise_sync_response(cluster_id)
Cloud-->>CLI: EnterpriseSyncResponse (.rs files + helix_toml)
CLI->>FS: write .rs files to target_queries_dir
CLI->>FS: update_project_queries_path_in_helix_toml()
end
CLI->>Cloud: fetch_project_clusters_for_[standard|enterprise]_cluster()
Cloud-->>CLI: CliProjectClusters
CLI->>Cloud: fetch_[standard|enterprise]_cluster_snapshot_configs(all clusters)
Cloud-->>CLI: snapshot HelixConfigs
CLI->>FS: reconcile_project_config_from_cloud() → write helix.toml
```
</details>
<sub>Last reviewed commit: c6b0d52</sub>
> Greptile also left **1 inline comment** on this PR.
<!-- /greptile_comment -->
Greptile Summary
This PR fixes three core correctness issues — unique edge enforcement, update-from-variable codegen, and upsert default-value handling — and improves the CLI
initcommand.Key changes:
add_e.rs: Rewrites unique-edge enforcement from the overly strictPutFlags::NO_OVERWRITE(which blocked any second same-label edge from a node regardless of target) to an explicit scan of existingout_edges_dbentries. The new logic correctly enforces uniqueness only on the(from, label, to)triple. Tests confirm multiple targets and sources are allowed.upsert.rs/traversal_steps.rs/traversal_validation.rs: Addsupsert_*_with_defaultsvariants that apply schema-definedDEFAULTvalues only during creation; updates correctly ignore them. The code generator now emitsupsert_*_with_defaults(...)calls with defaults extracted from schema at analysis time.TraversalType::Updaterefactored to carrysource/source_is_plural, enabling direct variable-to-update codegen without an intermediate read-only traversal.init.rs: Changes.gitignorecreation from unconditional overwrite to append with line deduplication. However, two bugs remain: (1) unconditional cleanup tracking even when file pre-existed, risking deletion of user content on failure; (2) no guard newline when appending to files lacking a trailing newline, potentially corrupting the last entry.upsert_*_with_defaultsbehaviour, and two HQL integration test suites (update_from_var,user_test_13).Findings:
.gitignoreappend logic has two correctness issues that can result in data loss or file corruption.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["add_edge called"] --> B{is_unique flag set?} B -->|Yes| C["Scan out_edges_db for from+label key"] C --> D{Duplicate to_node found?} D -->|Yes| E["Return DuplicateKey error"] D -->|No| F["Allocate Edge with v6 UUID"] B -->|No| F F --> G["Write to edges_db with APPEND"] G --> H["Write to out_edges_db with APPEND_DUP"] H --> I["Write to in_edges_db with APPEND_DUP"] I --> J["Return Ok with Edge value"] K["upsert_n_with_defaults called"] --> L{Iterator yields item?} L -->|Node exists| M["Update path: apply props only\nignore create_defaults"] L -->|Empty iterator| N["Create path: merge_create_props\nprops take priority over defaults"] L -->|Error| O["Propagate error"] N --> P["Insert new node with merged props"] M --> Q["Persist updated node"] R["UPDATE from variable"] --> S{Source variable type?} S -->|Single node var| T["G::new_mut_from with cloned var"] S -->|Collection var| U["G::new_mut_from_iter with cloned iter"] S -->|Inline traversal| V["Read-only traversal first then write"] T --> W[".update then collect_to_obj"] U --> W V --> WLast reviewed commit: 4e46111