impr (cli + hql): fixing upserts, updates, and unique edges + improving CLI init command by xav-db · Pull Request #880 · HelixDB/helix-db

xav-db · 2026-03-03T17:48:44Z

Greptile Summary

This PR fixes three core correctness issues — unique edge enforcement, update-from-variable codegen, and upsert default-value handling — and improves the CLI init command.

Key changes:

add_e.rs: Rewrites unique-edge enforcement from the overly strict PutFlags::NO_OVERWRITE (which blocked any second same-label edge from a node regardless of target) to an explicit scan of existing out_edges_db entries. The new logic correctly enforces uniqueness only on the (from, label, to) triple. Tests confirm multiple targets and sources are allowed.
upsert.rs / traversal_steps.rs / traversal_validation.rs: Adds upsert_*_with_defaults variants that apply schema-defined DEFAULT values only during creation; updates correctly ignore them. The code generator now emits upsert_*_with_defaults(...) calls with defaults extracted from schema at analysis time. TraversalType::Update refactored to carry source/source_is_plural, enabling direct variable-to-update codegen without an intermediate read-only traversal.
init.rs: Changes .gitignore creation from unconditional overwrite to append with line deduplication. However, two bugs remain: (1) unconditional cleanup tracking even when file pre-existed, risking deletion of user content on failure; (2) no guard newline when appending to files lacking a trailing newline, potentially corrupting the last entry.
Test additions: New unit tests for unique-edge semantics, upsert_*_with_defaults behaviour, and two HQL integration test suites (update_from_var, user_test_13).

Findings:

.gitignore append logic has two correctness issues that can result in data loss or file corruption.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["add_edge called"] --> B{is_unique flag set?}
    B -->|Yes| C["Scan out_edges_db for from+label key"]
    C --> D{Duplicate to_node found?}
    D -->|Yes| E["Return DuplicateKey error"]
    D -->|No| F["Allocate Edge with v6 UUID"]
    B -->|No| F
    F --> G["Write to edges_db with APPEND"]
    G --> H["Write to out_edges_db with APPEND_DUP"]
    H --> I["Write to in_edges_db with APPEND_DUP"]
    I --> J["Return Ok with Edge value"]

    K["upsert_n_with_defaults called"] --> L{Iterator yields item?}
    L -->|Node exists| M["Update path: apply props only\nignore create_defaults"]
    L -->|Empty iterator| N["Create path: merge_create_props\nprops take priority over defaults"]
    L -->|Error| O["Propagate error"]
    N --> P["Insert new node with merged props"]
    M --> Q["Persist updated node"]

    R["UPDATE from variable"] --> S{Source variable type?}
    S -->|Single node var| T["G::new_mut_from with cloned var"]
    S -->|Collection var| U["G::new_mut_from_iter with cloned iter"]
    S -->|Inline traversal| V["Read-only traversal first then write"]
    T --> W[".update then collect_to_obj"]
    U --> W
    V --> W

_{Last reviewed commit: 4e46111}

Fixes #864

Fixes #864 ## Description  ## Related Issues  Closes # ## Checklist when merging to main  - [ ] No compiler warnings (if applicable) - [ ] Code is formatted with `rustfmt` - [ ] No useless or dead code (if applicable) - [ ] Code is easy to understand - [ ] Doc comments are used for all functions, enums, structs, and fields (where appropriate) - [ ] All tests pass - [ ] Performance has not regressed (assuming change was not to fix a bug) - [ ] Version number has been updated in `helix-cli/Cargo.toml` and `helixdb/Cargo.toml` ## Additional Notes   <h3>Greptile Summary</h3> This PR successfully fixes bug #864 where `UpsertN` (and `UpsertE`/`UpsertV`) was ignoring schema-defined `DEFAULT` values during node creation and overwriting them during updates. **Key changes:** - `upsert.rs`: Adds `merge_create_props()` helper (explicit props win, defaults fill in the rest), and reworks `upsert_n/e/v` to delegate to new `_with_defaults` variants. Create path uses the merged props; update path correctly ignores `create_defaults`. - `traversal_validation.rs`: Analyzer now collects schema `DEFAULT` fields that are **not** explicitly specified in the upsert statement and passes them as `create_defaults` to the `TraversalType` variants. - `traversal_steps.rs`: `TraversalType::UpsertN/E/V` gains a `create_defaults` field; code generation emits `upsert_n_with_defaults` / `upsert_e_with_defaults` / `upsert_v_with_defaults` calls. - `upsert_tests.rs`: Comprehensive test coverage added for all three upsert types (`upsert_n`, `upsert_e`, `upsert_v`) with their `_with_defaults` variants. Tests verify (a) defaults are applied on creation with explicit values overriding defaults, and (b) defaults are **not** applied when updating an existing node/edge/vector. The implementation is consistent across all three upsert operations and correctly handles both create and update paths. <sub>Last reviewed commit: ae1eb10</sub>

<h3>Greptile Summary</h3> This PR successfully fixes a bug where `UPDATE` operations applied to a previously-assigned variable (e.g., `person <- N<Person>(id)` followed by `updated <- person::UPDATE({...})`) failed to track the source variable, causing the code generator to ignore the binding and fall back to an incorrect inline traversal. **Key Changes:** - **`TraversalType::Update`** is refactored from a simple tuple variant into a named-field struct carrying: - `source`: the bound variable reference (or `None` for inline updates) - `source_is_plural`: boolean flag for single vs. collection - `properties`: the update properties - **Validation Logic** (`traversal_validation.rs`): Before overwriting the traversal type, the source variable is extracted from the existing `FromSingle`/`FromIter` traversal type, ensuring it isn't lost. - **Code Generation** (`traversal_steps.rs`): When a source exists, the generator now correctly emits: - `G::new_mut_from` for singular source variables - `G::new_mut_from_iter` for plural (collection) sources - Falls back to inline traversal (with read-then-write) when source is `None` - **Test Coverage**: New `update_from_var` test suite covers single-node update from variable, filtered edge update from variable, and inline update regression check. User-reported test case `user_test_13` is added to capture the original bug scenario. The refactoring makes `Update` consistent with existing variants like `Upsert`, `UpsertN`, `UpsertE`, and `UpsertV` which already use named-field structs for similar purposes. <details><summary><h3>Flowchart</h3></summary> ```mermaid %%{init: {'theme': 'neutral'}}%% flowchart TD A["StepType::Update encountered in validate_traversal"] --> B["Extract source variable from current traversal type"] B --> C{Current TraversalType?} C -->|FromSingle| D["source = Some(var), source_is_plural = false"] C -->|FromIter| E["source = Some(var), source_is_plural = true"] C -->|Other| F["source = None"] D --> G["Set TraversalType::Update struct with source + source_is_plural + properties"] E --> G F --> G G --> H["Code Generation via Display impl"] H --> I{source is Some?} I -->|Yes, plural| J["G::new_mut_from_iter - db, txn, var.iter().cloned(), arena"] I -->|Yes, singular| K["G::new_mut_from - db, txn, var.clone(), arena"] I -->|No - inline| L["Block: collect read traversal, then G::new_mut_from_iter"] J --> M[".update properties .collect_to_obj?"] K --> M L --> M ``` </details> <sub>Last reviewed commit: b5d45ed</sub>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

## Description when we initialization helix with `helix init` in a folder that contains a .gitignore with some content that got wriped and overwritten. this PR fixes that ## Related Issues Closes #876 ## Checklist when merging to main  - [ ] No compiler warnings (if applicable) - [ ] Code is formatted with `rustfmt` - [ ] No useless or dead code (if applicable) - [ ] Code is easy to understand - [ ] Doc comments are used for all functions, enums, structs, and fields (where appropriate) - [ ] All tests pass - [ ] Performance has not regressed (assuming change was not to fix a bug) - [ ] Version number has been updated in `helix-cli/Cargo.toml` and `helixdb/Cargo.toml` ## Additional Notes   <h3>Greptile Summary</h3> Changed `.gitignore` handling from overwrite to append mode to preserve existing content when running `helix init`. - Reads existing `.gitignore` content before adding new entries - Only appends lines that don't already exist - Fixes issue #876 where existing `.gitignore` was being wiped **Issue found**: The duplicate check uses substring matching (`existing.contains(line)`) which can cause false positives - e.g., if `.gitignore` contains `my-target/`, the check for `target/` will incorrectly skip adding it. <details><summary><h3>Important Files Changed</h3></summary> | Filename | Overview | |----------|----------| | helix-cli/src/commands/init.rs | Fixed gitignore overwriting, but substring matching logic has edge case bugs | </details> </details> <sub>Last reviewed commit: 3e51c60</sub>

greptile-apps

_{15 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T17:55:08Z

helix-cli/src/commands/init.rs

-    fs::write(&gitignore_path, gitignore)?;
+    let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
+    let mut file = fs::OpenOptions::new()
+        .create(true)


Cleanup tracker deletes pre-existing .gitignore on failure

cleanup_tracker.track_file(gitignore_path) is called unconditionally (line 359), regardless of whether the .gitignore file existed before init ran. Previously this was fine because fs::write always created the file fresh. Now that we append to an existing file, if the init command fails at any later step and triggers cleanup, the tracker will delete a .gitignore that the user already had — along with all their existing content.

The tracking should only happen when the file is newly created:

let file_existed = gitignore_path.exists(); let existing = fs::read_to_string(&gitignore_path).unwrap_or_default(); let mut file = fs::OpenOptions::new() .create(true) .append(true) .open(&gitignore_path)?; for line in gitignore.lines() { let trimmed = line.trim(); if !existing.lines().any(|l| l.trim() == trimmed) { writeln!(file, "{}", line)?; } } if !file_existed { cleanup_tracker.track_file(gitignore_path); }

greptile-apps · 2026-03-03T17:55:09Z

helix-cli/src/commands/init.rs

+    let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
+    let mut file = fs::OpenOptions::new()
+        .create(true)
+        .append(true)
+        .open(&gitignore_path)?;
+    for line in gitignore.lines() {
+        let trimmed = line.trim();
+        if !existing.lines().any(|l| l.trim() == trimmed) {
+            writeln!(file, "{}", line)?;
+        }


Appending without a guard newline may corrupt the last line of an existing .gitignore

If the existing .gitignore does not end with a newline (which is technically valid), the first writeln! call (line 356) will concatenate its content onto the last line of the existing file. For example, if the file ends with node_modules (no trailing \n), the result would be node_modules.helix/ instead of two separate entries.

Consider writing a leading newline when the existing file is non-empty and does not already end with one:

let existing = fs::read_to_string(&gitignore_path).unwrap_or_default(); let mut file = fs::OpenOptions::new() .create(true) .append(true) .open(&gitignore_path)?; // Ensure we start on a fresh line if !existing.is_empty() && !existing.ends_with('\n') { writeln!(file)?; } for line in gitignore.lines() { let trimmed = line.trim(); if !existing.lines().any(|l| l.trim() == trimmed) { writeln!(file, "{}", line)?; } }

- Refactor the logic to append new entries to the .gitignore file instead of overwriting existing content. - Ensure that a newline is added before appending new entries if the file already contains content. - Add tests to verify that existing entries are not duplicated and that the .gitignore file is preserved during initialization failures. This change addresses the issue of losing existing .gitignore content when running `helix init`, improving the overall user experience.

Make BM25 explicitly node-only by removing vector indexing, rebuild schema v2 from node state on startup, and fail loudly when forward, reverse, and doc-length state diverge.

<h3>Greptile Summary</h3> This PR hardens the BM25 full-text search subsystem by introducing a **reverse index** (`doc_id → list of (term, tf)`) alongside the existing inverted index, enabling correct and efficient delete/update operations without re-scanning the inverted index. It also adds **schema versioning** so that on startup, if the stored index was built without the reverse index, the migration layer automatically clears and rebuilds the entire BM25 index from `nodes_db`. The `add_n`, `update`, and `upsert` traversal operators are updated to maintain BM25 invariants as nodes are created or changed. Key changes: - **`bm25.rs`**: Adds `ReversePostingEntry`, `DocPresence`/`DocState` enums, `reverse_index_db` (DUP_SORT), schema version read/write, and refactored `insert_doc` / `delete_doc` / `update_doc` that enforce strict consistency invariants (returns errors on index corruption). - **`storage_migration.rs`**: `migrate_bm25` clears and rebuilds the BM25 index in 1 024-node batches when the stored schema version doesn't match `BM25_SCHEMA_VERSION = 2`. - **`add_n.rs`**: `bm25.insert_doc` is called after writing the node — but without a guard that checks whether the `nodes_db` write succeeded, allowing BM25 errors to mask prior errors. - **`update.rs` / `upsert.rs`**: BM25 is correctly updated before (update) or after (upsert new node) the `nodes_db` write, with proper error propagation. - Comprehensive new tests cover the reverse index, schema migration, update-to-searchable, and invariant enforcement. <details><summary><h3>Important Files Changed</h3></summary> | Filename | Overview | |----------|----------| | helix-db/src/helix_engine/bm25/bm25.rs | Core BM25 rewrite adding a reverse index (doc_id → terms) and schema versioning. Logic is sound overall; minor duplication between reverse_entries and reverse_entries_rw. | | helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs | BM25 insert_doc is called unconditionally after nodes_db write, even on prior failure — can mask the original error if BM25 also fails. | | helix-db/src/helix_engine/storage_core/storage_migration.rs | Adds migrate_bm25 that clears and rebuilds the BM25 index from nodes_db. Holding read_txn alive across batch write transactions can cause LMDB freelist growth; nodes without properties are silently skipped (consistent with add_n but deserves a comment). | | helix-db/src/helix_engine/traversal_core/ops/util/update.rs | Correctly calls bm25.update_doc before nodes_db write; BM25 error short-circuits the node save, and both are in the same transaction so rollback is safe. | | helix-db/src/helix_engine/traversal_core/ops/util/upsert.rs | BM25 insert_doc (new node) and update_doc (existing node) are correctly integrated into upsert_n; error propagation follows the existing ? pattern. | </details> </details> <details><summary><h3>Sequence Diagram</h3></summary> ```mermaid sequenceDiagram participant Caller participant migrate_bm25 participant BM25Index participant nodes_db participant add_n / update / upsert Note over Caller, BM25Index: Startup Migration Path Caller->>migrate_bm25: migrate(storage) migrate_bm25->>BM25Index: schema_version(read_txn) alt version matches BM25_SCHEMA_VERSION BM25Index-->>migrate_bm25: Some(2) — skip else outdated or missing migrate_bm25->>BM25Index: clear_all(write_txn) migrate_bm25->>nodes_db: iter(read_txn) — batch by 1024 loop Each batch migrate_bm25->>BM25Index: insert_doc per node with properties (write_txn) end migrate_bm25->>BM25Index: write_schema_version(2, write_txn) end Note over Caller, BM25Index: Normal Write Path Caller->>add_n / update / upsert: write op (RwTxn) add_n / update / upsert->>nodes_db: put node add_n / update / upsert->>BM25Index: insert_doc / update_doc / delete_doc BM25Index->>BM25Index: update inverted_index_db (term→postings) BM25Index->>BM25Index: update reverse_index_db (doc_id→terms) BM25Index->>BM25Index: update doc_lengths_db BM25Index->>BM25Index: update term_frequencies_db BM25Index->>BM25Index: update metadata (total_docs, avgdl) add_n / update / upsert-->>Caller: Result<TraversalValue> ``` </details>  <details><summary><h3>Comments Outside Diff (1)</h3></summary> 1. `helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs`, line 141-148 ([link](https://github.com/helixdb/helix-db/blob/8d28406c1ec70c56ba3100205080a9bb7a9ea17c/helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs#L141-L148)) **BM25 insert runs unconditionally, can mask prior errors** The BM25 `insert_doc` call (lines 141-148) runs even when a previous operation failed (secondary-index insertion or `nodes_db.put_with_flags`). If `bm25.insert_doc` then also fails, it silently overwrites the original error stored in `result`, making the caller receive a BM25 error instead of the real `nodes_db` or secondary-index error. Adding an early-exit guard fixes both the error masking and the wasted BM25 work: </details>  <sub>Last reviewed commit: 8d28406</sub>

<h3>Greptile Summary</h3> This PR fixes sync reconciliation so that the correct remote queries path (read from the cluster's snapshot `helix.toml`) is used instead of always falling back to the local project path. It refactors several sync functions to accept explicit `current_queries_dir` / `target_queries_dir` arguments, adds enterprise-cluster parity to all sync flows, extracts a reusable `build_standard_deploy_payload` helper (now including `runtime_config`), and adds meaningful unit-test coverage for the new merge/extraction helpers. **Key issues:** - `pull_remote_snapshot_into_local` handles directory migration when the remote queries path differs from the local one, but the migration code path deletes files from `current_queries_dir` **before** writing to `target_queries_dir`, risking data loss if a write fails mid-way (the same-dir code path retains the safe write-then-delete order from the original implementation). - `insert_unique_cloud_instance_name` and `insert_unique_enterprise_instance_name` introduce redundant double-clones that can be simplified. <details><summary><h3>Sequence Diagram</h3></summary> ```mermaid sequenceDiagram participant CLI participant Cloud as Helix Cloud API participant FS as Local Filesystem Note over CLI: sync (project flow) CLI->>Cloud: fetch_project_clusters() CLI->>Cloud: fetch_standard/enterprise_cluster_snapshot_config(selected_cluster) Cloud-->>CLI: HelixConfig (remote helix.toml) CLI->>CLI: resolve_selected_project_queries_path(snapshot) alt Standard cluster CLI->>Cloud: fetch_sync_response(cluster_id) Cloud-->>CLI: SyncResponse (.hx files + helix_toml) CLI->>CLI: compare_manifests(local, remote) alt Pull CLI->>FS: pull_remote_snapshot_into_local(current_dir, target_dir, ...) CLI->>FS: update_project_queries_path_in_helix_toml() else Push CLI->>Cloud: push_local_snapshot_to_cluster() end else Enterprise cluster CLI->>Cloud: fetch_enterprise_sync_response(cluster_id) Cloud-->>CLI: EnterpriseSyncResponse (.rs files + helix_toml) CLI->>FS: write .rs files to target_queries_dir CLI->>FS: update_project_queries_path_in_helix_toml() end CLI->>Cloud: fetch_project_clusters_for_[standard|enterprise]_cluster() Cloud-->>CLI: CliProjectClusters CLI->>Cloud: fetch_[standard|enterprise]_cluster_snapshot_configs(all clusters) Cloud-->>CLI: snapshot HelixConfigs CLI->>FS: reconcile_project_config_from_cloud() → write helix.toml ``` </details> <sub>Last reviewed commit: c6b0d52</sub> > Greptile also left **1 inline comment** on this PR.

xav-db and others added 11 commits February 13, 2026 12:09

fixing update issues not tracking source variable

b5d45ed

[Bug]: UpsertN ignores and overwrites default values.

ae1eb10

Fixes #864

fix

3e51c60

Update helix-cli/src/commands/init.rs

6574ff9

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

ref

13e4d64

Issue with UNIQUE edges

2d33957

updating versions

ca02338

reverting versions

88c8af9

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

xav-db and others added 7 commits March 4, 2026 17:54

fix sync reconciliation for remote queries paths

c6b0d52

rebuild BM25 from nodes and harden index invariants

8d28406

Make BM25 explicitly node-only by removing vector indexing, rebuild schema v2 from node state on startup, and fail loudly when forward, reverse, and doc-length state diverge.

greptile fix and clippy fix

a243265

updating versions

7e7356a

xav-db merged commit 960c46d into main Mar 8, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

impr (cli + hql): fixing upserts, updates, and unique edges + improving CLI init command#880

impr (cli + hql): fixing upserts, updates, and unique edges + improving CLI init command#880
xav-db merged 18 commits intomainfrom
dev

xav-db commented Mar 3, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Mar 3, 2026

Uh oh!

greptile-apps bot Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xav-db commented Mar 3, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xav-db commented Mar 3, 2026 •

edited by greptile-apps bot

Loading