Skip to content

impr (cli + hql): fixing upserts, updates, and unique edges + improving CLI init command#880

Merged
xav-db merged 18 commits intomainfrom
dev
Mar 8, 2026
Merged

impr (cli + hql): fixing upserts, updates, and unique edges + improving CLI init command#880
xav-db merged 18 commits intomainfrom
dev

Conversation

@xav-db
Copy link
Member

@xav-db xav-db commented Mar 3, 2026

Greptile Summary

This PR fixes three core correctness issues — unique edge enforcement, update-from-variable codegen, and upsert default-value handling — and improves the CLI init command.

Key changes:

  • add_e.rs: Rewrites unique-edge enforcement from the overly strict PutFlags::NO_OVERWRITE (which blocked any second same-label edge from a node regardless of target) to an explicit scan of existing out_edges_db entries. The new logic correctly enforces uniqueness only on the (from, label, to) triple. Tests confirm multiple targets and sources are allowed.
  • upsert.rs / traversal_steps.rs / traversal_validation.rs: Adds upsert_*_with_defaults variants that apply schema-defined DEFAULT values only during creation; updates correctly ignore them. The code generator now emits upsert_*_with_defaults(...) calls with defaults extracted from schema at analysis time. TraversalType::Update refactored to carry source/source_is_plural, enabling direct variable-to-update codegen without an intermediate read-only traversal.
  • init.rs: Changes .gitignore creation from unconditional overwrite to append with line deduplication. However, two bugs remain: (1) unconditional cleanup tracking even when file pre-existed, risking deletion of user content on failure; (2) no guard newline when appending to files lacking a trailing newline, potentially corrupting the last entry.
  • Test additions: New unit tests for unique-edge semantics, upsert_*_with_defaults behaviour, and two HQL integration test suites (update_from_var, user_test_13).

Findings:

  • .gitignore append logic has two correctness issues that can result in data loss or file corruption.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["add_edge called"] --> B{is_unique flag set?}
    B -->|Yes| C["Scan out_edges_db for from+label key"]
    C --> D{Duplicate to_node found?}
    D -->|Yes| E["Return DuplicateKey error"]
    D -->|No| F["Allocate Edge with v6 UUID"]
    B -->|No| F
    F --> G["Write to edges_db with APPEND"]
    G --> H["Write to out_edges_db with APPEND_DUP"]
    H --> I["Write to in_edges_db with APPEND_DUP"]
    I --> J["Return Ok with Edge value"]

    K["upsert_n_with_defaults called"] --> L{Iterator yields item?}
    L -->|Node exists| M["Update path: apply props only\nignore create_defaults"]
    L -->|Empty iterator| N["Create path: merge_create_props\nprops take priority over defaults"]
    L -->|Error| O["Propagate error"]
    N --> P["Insert new node with merged props"]
    M --> Q["Persist updated node"]

    R["UPDATE from variable"] --> S{Source variable type?}
    S -->|Single node var| T["G::new_mut_from with cloned var"]
    S -->|Collection var| U["G::new_mut_from_iter with cloned iter"]
    S -->|Inline traversal| V["Read-only traversal first then write"]
    T --> W[".update then collect_to_obj"]
    U --> W
    V --> W
Loading

Last reviewed commit: 4e46111

xav-db and others added 11 commits February 13, 2026 12:09
Fixes #864

## Description
<!-- Provide a brief description of the changes in this PR -->

## Related Issues
<!-- Link to any related issues using #issue_number -->

Closes #

## Checklist when merging to main
<!-- Mark items with "x" when completed -->

- [ ] No compiler warnings (if applicable)
- [ ] Code is formatted with `rustfmt`
- [ ] No useless or dead code (if applicable)
- [ ] Code is easy to understand
- [ ] Doc comments are used for all functions, enums, structs, and
fields (where appropriate)
- [ ] All tests pass
- [ ] Performance has not regressed (assuming change was not to fix a
bug)
- [ ] Version number has been updated in `helix-cli/Cargo.toml` and
`helixdb/Cargo.toml`

## Additional Notes
<!-- Add any additional information that would be helpful for reviewers
-->

<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR successfully fixes bug #864 where `UpsertN` (and
`UpsertE`/`UpsertV`) was ignoring schema-defined `DEFAULT` values during
node creation and overwriting them during updates.

**Key changes:**
- `upsert.rs`: Adds `merge_create_props()` helper (explicit props win,
defaults fill in the rest), and reworks `upsert_n/e/v` to delegate to
new `_with_defaults` variants. Create path uses the merged props; update
path correctly ignores `create_defaults`.
- `traversal_validation.rs`: Analyzer now collects schema `DEFAULT`
fields that are **not** explicitly specified in the upsert statement and
passes them as `create_defaults` to the `TraversalType` variants.
- `traversal_steps.rs`: `TraversalType::UpsertN/E/V` gains a
`create_defaults` field; code generation emits `upsert_n_with_defaults`
/ `upsert_e_with_defaults` / `upsert_v_with_defaults` calls.
- `upsert_tests.rs`: Comprehensive test coverage added for all three
upsert types (`upsert_n`, `upsert_e`, `upsert_v`) with their
`_with_defaults` variants. Tests verify (a) defaults are applied on
creation with explicit values overriding defaults, and (b) defaults are
**not** applied when updating an existing node/edge/vector.

The implementation is consistent across all three upsert operations and
correctly handles both create and update paths.

<sub>Last reviewed commit: ae1eb10</sub>

<!-- greptile_other_comments_section -->

<!-- /greptile_comment -->
<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR successfully fixes a bug where `UPDATE` operations applied to a
previously-assigned variable (e.g., `person <- N<Person>(id)` followed
by `updated <- person::UPDATE({...})`) failed to track the source
variable, causing the code generator to ignore the binding and fall back
to an incorrect inline traversal.

**Key Changes:**
- **`TraversalType::Update`** is refactored from a simple tuple variant
into a named-field struct carrying:
- `source`: the bound variable reference (or `None` for inline updates)
  - `source_is_plural`: boolean flag for single vs. collection
  - `properties`: the update properties
  
- **Validation Logic** (`traversal_validation.rs`): Before overwriting
the traversal type, the source variable is extracted from the existing
`FromSingle`/`FromIter` traversal type, ensuring it isn't lost.

- **Code Generation** (`traversal_steps.rs`): When a source exists, the
generator now correctly emits:
  - `G::new_mut_from` for singular source variables
  - `G::new_mut_from_iter` for plural (collection) sources
- Falls back to inline traversal (with read-then-write) when source is
`None`

- **Test Coverage**: New `update_from_var` test suite covers single-node
update from variable, filtered edge update from variable, and inline
update regression check. User-reported test case `user_test_13` is added
to capture the original bug scenario.

The refactoring makes `Update` consistent with existing variants like
`Upsert`, `UpsertN`, `UpsertE`, and `UpsertV` which already use
named-field structs for similar purposes.

<details><summary><h3>Flowchart</h3></summary>

```mermaid
%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["StepType::Update encountered in validate_traversal"] --> B["Extract source variable from current traversal type"]
    B --> C{Current TraversalType?}
    C -->|FromSingle| D["source = Some(var), source_is_plural = false"]
    C -->|FromIter| E["source = Some(var), source_is_plural = true"]
    C -->|Other| F["source = None"]
    D --> G["Set TraversalType::Update struct with source + source_is_plural + properties"]
    E --> G
    F --> G
    G --> H["Code Generation via Display impl"]
    H --> I{source is Some?}
    I -->|Yes, plural| J["G::new_mut_from_iter - db, txn, var.iter().cloned(), arena"]
    I -->|Yes, singular| K["G::new_mut_from - db, txn, var.clone(), arena"]
    I -->|No - inline| L["Block: collect read traversal, then G::new_mut_from_iter"]
    J --> M[".update properties .collect_to_obj?"]
    K --> M
    L --> M
```
</details>


<sub>Last reviewed commit: b5d45ed</sub>

<!-- greptile_other_comments_section -->

<!-- /greptile_comment -->
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
## Description
when we initialization helix with `helix init` in a folder that contains
a .gitignore with some content that got wriped and overwritten. this PR
fixes that
## Related Issues
Closes #876 

## Checklist when merging to main
<!-- Mark items with "x" when completed -->

- [ ] No compiler warnings (if applicable)
- [ ] Code is formatted with `rustfmt`
- [ ] No useless or dead code (if applicable)
- [ ] Code is easy to understand
- [ ] Doc comments are used for all functions, enums, structs, and
fields (where appropriate)
- [ ] All tests pass
- [ ] Performance has not regressed (assuming change was not to fix a
bug)
- [ ] Version number has been updated in `helix-cli/Cargo.toml` and
`helixdb/Cargo.toml`

## Additional Notes
<!-- Add any additional information that would be helpful for reviewers
-->

<!-- greptile_comment -->

<h3>Greptile Summary</h3>

Changed `.gitignore` handling from overwrite to append mode to preserve
existing content when running `helix init`.

- Reads existing `.gitignore` content before adding new entries
- Only appends lines that don't already exist
- Fixes issue #876 where existing `.gitignore` was being wiped

**Issue found**: The duplicate check uses substring matching
(`existing.contains(line)`) which can cause false positives - e.g., if
`.gitignore` contains `my-target/`, the check for `target/` will
incorrectly skip adding it.

<details><summary><h3>Important Files Changed</h3></summary>




| Filename | Overview |
|----------|----------|
| helix-cli/src/commands/init.rs | Fixed gitignore overwriting, but
substring matching logic has edge case bugs |

</details>


</details>


<sub>Last reviewed commit: 3e51c60</sub>

<!-- greptile_other_comments_section -->

<!-- /greptile_comment -->
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

fs::write(&gitignore_path, gitignore)?;
let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
let mut file = fs::OpenOptions::new()
.create(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup tracker deletes pre-existing .gitignore on failure

cleanup_tracker.track_file(gitignore_path) is called unconditionally (line 359), regardless of whether the .gitignore file existed before init ran. Previously this was fine because fs::write always created the file fresh. Now that we append to an existing file, if the init command fails at any later step and triggers cleanup, the tracker will delete a .gitignore that the user already had — along with all their existing content.

The tracking should only happen when the file is newly created:

let file_existed = gitignore_path.exists();
let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
let mut file = fs::OpenOptions::new()
    .create(true)
    .append(true)
    .open(&gitignore_path)?;
for line in gitignore.lines() {
    let trimmed = line.trim();
    if !existing.lines().any(|l| l.trim() == trimmed) {
        writeln!(file, "{}", line)?;
    }
}
if !file_existed {
    cleanup_tracker.track_file(gitignore_path);
}

Comment on lines +348 to +357
let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
let mut file = fs::OpenOptions::new()
.create(true)
.append(true)
.open(&gitignore_path)?;
for line in gitignore.lines() {
let trimmed = line.trim();
if !existing.lines().any(|l| l.trim() == trimmed) {
writeln!(file, "{}", line)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appending without a guard newline may corrupt the last line of an existing .gitignore

If the existing .gitignore does not end with a newline (which is technically valid), the first writeln! call (line 356) will concatenate its content onto the last line of the existing file. For example, if the file ends with node_modules (no trailing \n), the result would be node_modules.helix/ instead of two separate entries.

Consider writing a leading newline when the existing file is non-empty and does not already end with one:

let existing = fs::read_to_string(&gitignore_path).unwrap_or_default();
let mut file = fs::OpenOptions::new()
    .create(true)
    .append(true)
    .open(&gitignore_path)?;
// Ensure we start on a fresh line
if !existing.is_empty() && !existing.ends_with('\n') {
    writeln!(file)?;
}
for line in gitignore.lines() {
    let trimmed = line.trim();
    if !existing.lines().any(|l| l.trim() == trimmed) {
        writeln!(file, "{}", line)?;
    }
}

xav-db and others added 7 commits March 4, 2026 17:54
- Refactor the logic to append new entries to the .gitignore file instead of overwriting existing content.
- Ensure that a newline is added before appending new entries if the file already contains content.
- Add tests to verify that existing entries are not duplicated and that the .gitignore file is preserved during initialization failures.

This change addresses the issue of losing existing .gitignore content when running `helix init`, improving the overall user experience.
Make BM25 explicitly node-only by removing vector indexing, rebuild schema v2 from node state on startup, and fail loudly when forward, reverse, and doc-length state diverge.
<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR hardens the BM25 full-text search subsystem by introducing a
**reverse index** (`doc_id → list of (term, tf)`) alongside the existing
inverted index, enabling correct and efficient delete/update operations
without re-scanning the inverted index. It also adds **schema
versioning** so that on startup, if the stored index was built without
the reverse index, the migration layer automatically clears and rebuilds
the entire BM25 index from `nodes_db`. The `add_n`, `update`, and
`upsert` traversal operators are updated to maintain BM25 invariants as
nodes are created or changed.

Key changes:
- **`bm25.rs`**: Adds `ReversePostingEntry`, `DocPresence`/`DocState`
enums, `reverse_index_db` (DUP_SORT), schema version read/write, and
refactored `insert_doc` / `delete_doc` / `update_doc` that enforce
strict consistency invariants (returns errors on index corruption).
- **`storage_migration.rs`**: `migrate_bm25` clears and rebuilds the
BM25 index in 1 024-node batches when the stored schema version doesn't
match `BM25_SCHEMA_VERSION = 2`.
- **`add_n.rs`**: `bm25.insert_doc` is called after writing the node —
but without a guard that checks whether the `nodes_db` write succeeded,
allowing BM25 errors to mask prior errors.
- **`update.rs` / `upsert.rs`**: BM25 is correctly updated before
(update) or after (upsert new node) the `nodes_db` write, with proper
error propagation.
- Comprehensive new tests cover the reverse index, schema migration,
update-to-searchable, and invariant enforcement.

<details><summary><h3>Important Files Changed</h3></summary>




| Filename | Overview |
|----------|----------|
| helix-db/src/helix_engine/bm25/bm25.rs | Core BM25 rewrite adding a
reverse index (doc_id → terms) and schema versioning. Logic is sound
overall; minor duplication between reverse_entries and
reverse_entries_rw. |
| helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs | BM25
insert_doc is called unconditionally after nodes_db write, even on prior
failure — can mask the original error if BM25 also fails. |
| helix-db/src/helix_engine/storage_core/storage_migration.rs | Adds
migrate_bm25 that clears and rebuilds the BM25 index from nodes_db.
Holding read_txn alive across batch write transactions can cause LMDB
freelist growth; nodes without properties are silently skipped
(consistent with add_n but deserves a comment). |
| helix-db/src/helix_engine/traversal_core/ops/util/update.rs |
Correctly calls bm25.update_doc before nodes_db write; BM25 error
short-circuits the node save, and both are in the same transaction so
rollback is safe. |
| helix-db/src/helix_engine/traversal_core/ops/util/upsert.rs | BM25
insert_doc (new node) and update_doc (existing node) are correctly
integrated into upsert_n; error propagation follows the existing ?
pattern. |

</details>


</details>


<details><summary><h3>Sequence Diagram</h3></summary>

```mermaid
sequenceDiagram
    participant Caller
    participant migrate_bm25
    participant BM25Index
    participant nodes_db
    participant add_n / update / upsert

    Note over Caller, BM25Index: Startup Migration Path
    Caller->>migrate_bm25: migrate(storage)
    migrate_bm25->>BM25Index: schema_version(read_txn)
    alt version matches BM25_SCHEMA_VERSION
        BM25Index-->>migrate_bm25: Some(2) — skip
    else outdated or missing
        migrate_bm25->>BM25Index: clear_all(write_txn)
        migrate_bm25->>nodes_db: iter(read_txn) — batch by 1024
        loop Each batch
            migrate_bm25->>BM25Index: insert_doc per node with properties (write_txn)
        end
        migrate_bm25->>BM25Index: write_schema_version(2, write_txn)
    end

    Note over Caller, BM25Index: Normal Write Path
    Caller->>add_n / update / upsert: write op (RwTxn)
    add_n / update / upsert->>nodes_db: put node
    add_n / update / upsert->>BM25Index: insert_doc / update_doc / delete_doc
    BM25Index->>BM25Index: update inverted_index_db (term→postings)
    BM25Index->>BM25Index: update reverse_index_db (doc_id→terms)
    BM25Index->>BM25Index: update doc_lengths_db
    BM25Index->>BM25Index: update term_frequencies_db
    BM25Index->>BM25Index: update metadata (total_docs, avgdl)
    add_n / update / upsert-->>Caller: Result<TraversalValue>
```
</details>


<!-- greptile_failed_comments -->
<details><summary><h3>Comments Outside Diff (1)</h3></summary>

1. `helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs`, line
141-148
([link](https://github.com/helixdb/helix-db/blob/8d28406c1ec70c56ba3100205080a9bb7a9ea17c/helix-db/src/helix_engine/traversal_core/ops/source/add_n.rs#L141-L148))

   **BM25 insert runs unconditionally, can mask prior errors**

The BM25 `insert_doc` call (lines 141-148) runs even when a previous
operation failed (secondary-index insertion or
`nodes_db.put_with_flags`). If `bm25.insert_doc` then also fails, it
silently overwrites the original error stored in `result`, making the
caller receive a BM25 error instead of the real `nodes_db` or
secondary-index error. Adding an early-exit guard fixes both the error
masking and the wasted BM25 work:

</details>

<!-- /greptile_failed_comments -->

<sub>Last reviewed commit: 8d28406</sub>

<!-- /greptile_comment -->
<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR fixes sync reconciliation so that the correct remote queries
path (read from the cluster's snapshot `helix.toml`) is used instead of
always falling back to the local project path. It refactors several sync
functions to accept explicit `current_queries_dir` /
`target_queries_dir` arguments, adds enterprise-cluster parity to all
sync flows, extracts a reusable `build_standard_deploy_payload` helper
(now including `runtime_config`), and adds meaningful unit-test coverage
for the new merge/extraction helpers.

**Key issues:**
- `pull_remote_snapshot_into_local` handles directory migration when the
remote queries path differs from the local one, but the migration code
path deletes files from `current_queries_dir` **before** writing to
`target_queries_dir`, risking data loss if a write fails mid-way (the
same-dir code path retains the safe write-then-delete order from the
original implementation).
- `insert_unique_cloud_instance_name` and
`insert_unique_enterprise_instance_name` introduce redundant
double-clones that can be simplified.

<details><summary><h3>Sequence Diagram</h3></summary>

```mermaid
sequenceDiagram
    participant CLI
    participant Cloud as Helix Cloud API
    participant FS as Local Filesystem

    Note over CLI: sync (project flow)
    CLI->>Cloud: fetch_project_clusters()
    CLI->>Cloud: fetch_standard/enterprise_cluster_snapshot_config(selected_cluster)
    Cloud-->>CLI: HelixConfig (remote helix.toml)
    CLI->>CLI: resolve_selected_project_queries_path(snapshot)

    alt Standard cluster
        CLI->>Cloud: fetch_sync_response(cluster_id)
        Cloud-->>CLI: SyncResponse (.hx files + helix_toml)
        CLI->>CLI: compare_manifests(local, remote)
        alt Pull
            CLI->>FS: pull_remote_snapshot_into_local(current_dir, target_dir, ...)
            CLI->>FS: update_project_queries_path_in_helix_toml()
        else Push
            CLI->>Cloud: push_local_snapshot_to_cluster()
        end
    else Enterprise cluster
        CLI->>Cloud: fetch_enterprise_sync_response(cluster_id)
        Cloud-->>CLI: EnterpriseSyncResponse (.rs files + helix_toml)
        CLI->>FS: write .rs files to target_queries_dir
        CLI->>FS: update_project_queries_path_in_helix_toml()
    end

    CLI->>Cloud: fetch_project_clusters_for_[standard|enterprise]_cluster()
    Cloud-->>CLI: CliProjectClusters
    CLI->>Cloud: fetch_[standard|enterprise]_cluster_snapshot_configs(all clusters)
    Cloud-->>CLI: snapshot HelixConfigs
    CLI->>FS: reconcile_project_config_from_cloud() → write helix.toml
```
</details>

<sub>Last reviewed commit: c6b0d52</sub>

> Greptile also left **1 inline comment** on this PR.

<!-- /greptile_comment -->
@xav-db xav-db merged commit 960c46d into main Mar 8, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants