tabular data ingestion by tomer-levin-nv · Pull Request #1720 · NVIDIA/NeMo-Retriever

tomer-levin-nv · 2026-03-25T15:50:00Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

This reverts commit 2e71414.

…_prompts)

…ix-duckdb_engine Made-with: Cursor

…hemas_dal - extract_data.py: add missing `queries` variable to tuple unpack from create_dataframe (returned 6 values, expected 5) - schemas_dal.py: add missing commas between all fields in WITH clause and collect({}) map literal in get_schema_columns query Made-with: Cursor

…on and ExtractParams defaults - Rename EmbedParams.embedding_api_key -> api_key and update all call sites (batch.py, inprocess.py) - Add strict modality validation to EmbedParams (raise on invalid, replace silent image_text remap) - Add VALID_EMBED_MODALITIES constant; narrow IMAGE_MODALITIES to exclude image_text - Update ExtractParams defaults: method="pdfium", image_format="jpeg", jpeg_quality=100, render_mode="fit_to_model" - Raise TextChunkParams.max_tokens default from 512 to 1024 - Simplify executor.run_mode_ingest: remove embed_params/vdb_params args (callers set these on ingestor directly) - Remove unused metrics_parser.py - Add client dependency tweak (client/pyproject.toml) Made-with: Cursor

…dule Made-with: Cursor

…in ingestor.py Made-with: Cursor

… comment spacing) Made-with: Cursor

Renames all public API methods, params classes, internal attributes, constants, and helper functions that used 'structured' to 'tabular' to better reflect that the pipeline operates on relational/tabular data. Key changes: - Params: Structured*Params → Tabular*Params (and remove unused TabularPIIParams) - Ingestor methods: pull/store/populate/generate/get_*structured* → *tabular* - BatchIngestor: ingest_structured → ingest_tabular; _structured_* attrs → _tabular_* - Executor/runners: run_mode_ingest_structured → run_mode_ingest_tabular; run_batch_structured → run_batch_tabular (file renamed accordingly) - LanceDB table constant: _STRUCTURED_TABLE/"nv-ingest-structured" → _TABULAR_TABLE/"nv-ingest-tabular" - Helper: data_for_populate_structured → data_for_populate_tabular Made-with: Cursor

Made-with: Cursor

…nectors/ingestion/retrieval/neo4j Renames the top-level folder from relational_db to tabular_data and restructures its contents into four clear sub-packages: - connectors/ — DB connectors (DuckDB, Spider2) - ingestion/ — extract_data, population/graph, prepare_for_embedding - retrieval/ — generate_sql (merged from generate_sql/ facade + sql_tool/) - neo4j/ — Neo4j connection management (was neo4j_connection/) Updates all import paths across the codebase accordingly. Made-with: Cursor

…Cypher queries - Add Edges class to reserved_words.py with CONTAINS, CONNECTING, FOREIGN_KEY constants - Make RelTypes inherit from Edges for backward compatibility - Replace all hardcoded node labels and relationship type strings in db_dal.py with Labels/Edges constants - Remove is_temp guard and its dependency on Labels.TEMP_SCHEMA in update_diff_from_existing_schema Made-with: Cursor

…are_embedding_text sibling - Remove the population/ wrapper directory; graph/ and populate_data.py now live directly under ingestion/ - Move prepare_for_embedding/prepare_embedding_text.py to ingestion/prepare_embedding_text.py and remove the now-empty subdirectory - Update all ingestion.population.* imports to ingestion.* across the codebase Made-with: Cursor

- Move duckdb and spider2 imports to top-level; remove lazy try/except blocks - Rename prepare_embedding_text.py to embeddings.py; update import in batch.py - Remove is_temp support: drop TEMP_SCHEMA/TABLE/COLUMN labels, params, and DataFrame assignments - Remove label_to_type function and all call sites across schema, node, and utils_dal - Remove include_deleted parameter; hardcode deleted-record filter in all DAL queries - Delete dead tables_dal.py and unused entity_exists_in_graph_insensitive function - Drop Connection constraint from indexes.py (label no longer exists) - Remove NULL AS "created" column from duckdb get_tables query Made-with: Cursor

Relocate setup_spider2.py, spider2_loader.py, and SPIDER2_SETUP.md from connectors/ into a new nemo_retriever/tabular-dev-tools/ folder alongside tests/. Update spider2_loader import to a local sibling import and fix docstring run paths accordingly. Made-with: Cursor

Made-with: Cursor

…ame and remove debug file - Merge fetch_relational_db_for_embedding + neo4j_tables_result_to_embedding_dataframe into a single fetch_tabular_embedding_dataframe in embeddings.py - Move the import to the top of batch.py; simplify call site to check df.empty directly - Delete debug_run_mode_ingest.py (unreferenced debug script) Made-with: Cursor

…n up schema ingestion - Drop account_id from Neo4j uniqueness constraint and index - Delete unused docker-compose.neo4j.yaml - Remove table_type and ordinal_position from DuckDB schema queries - Remove table property diffing and single-node update helper from db_dal - Simplify column diff tracking to data_type and is_nullable only - Add comment to update_properties_in_graph_batch Made-with: Cursor

Drop ordinal_position, default, length, comment, and scale from the column fetch query, keeping only data_type and is_nullable. Made-with: Cursor

…eries Made-with: Cursor

…_time, last_altered, default, length, scale from table/column models Simplifies the schema by keeping only essential fields (created, description for tables; data_type, is_nullable, ordinal_position, description for columns) and renames comment -> description throughout. Made-with: Cursor

Made-with: Cursor

The fulltext index was never queried anywhere in the codebase and was also being redundantly re-created on every loop iteration. Made-with: Cursor

…tional_db_data to extract_tabular_db_data Functions actively normalize and coerce DataFrame types rather than just loading, so the new names better reflect their behaviour. Made-with: Cursor

The new name better describes the file's responsibility: writing parsed tabular data as nodes and edges into Neo4j. Made-with: Cursor

tomer-levin-nv and others added 30 commits March 25, 2026 17:31

add ingest_structured entry point

bb76bad

structured methods placeholders for inprocess ingestor

93e4250

run ingest structured as future task

8161d1a

neo4j

fae96f1

DuckDB and spider

7e46e17

fix

99fd509

DuckDB fixes

edab134

use neo_con from the store

d7a3a29

fix

6c46669

fix file

6eb23d3

Revert "fix"

3f6ddfa

This reverts commit 2e71414.

adding db population files

7b0bd89

Add description_suggestion module (description_dal, functions, system…

8f17215

…_prompts)

update the uses of neo4j and duckdb

e5c67f1

remove bi

da33094

add neo4j txt retrieval for embedding

439bff0

trying to run

16a52fa

fixes to db retrieval funcs

8a075f9

fix imports

c5d90d7

Merge remote-tracking branch 'origin/add-population-files' into fix/f…

bd371c4

…ix-duckdb_engine Made-with: Cursor

fix duckdb

bd6f88f

ingest_structured pipeline

39cce86

clean population

9584de0

extract data and docker neo4j compose

4fd1d3c

prepare for embeddings

fab43e9

remove coalesce

3694743

init embed and vdb params

5384ca4

remove unused functions from prepare_embedding_text.py

3a58fc8

rename node labels to PascalCase and :schema relationship to :CONTAINS

430236a

add lance db table for structured

a32ebb6

tomer-levin-nv and others added 26 commits March 25, 2026 17:32

refactor: use __name__ for logger names across relational_db graph mo…

314380e

…dule Made-with: Cursor

fix: remove duplicate structured pipeline methods and unused imports …

f17391a

…in ingestor.py Made-with: Cursor

style: fix pre-commit issues (end-of-file newlines, black formatting,…

c6a44eb

… comment spacing) Made-with: Cursor

refactor: rename populate_structured_data → populate_tabular_data

1e22116

Made-with: Cursor

fix(duckdb): make connection param required (non-optional)

e47c041

Made-with: Cursor

update neo4j

aeddf2e

refactor(schemas_dal): remove unused column fields from graph query

294dd35

Drop ordinal_position, default, length, comment, and scale from the column fetch query, keeping only data_type and is_nullable. Made-with: Cursor

refactor(schemas_dal): remove deleted-node filters from all schema qu…

407eb47

…eries Made-with: Cursor

fix connector

e920be6

refactor(tabular_data): remove unused added_or_modified_tables tracking

0fe3b49

Made-with: Cursor

refactor(indexes): remove unused fulltext name_index

3174ea5

The fulltext index was never queried anywhere in the codebase and was also being redundantly re-created on every loop iteration. Made-with: Cursor

format

b200104

refactor(tabular_data): rename load_* to normalize_* and extract_rela…

50c2f39

…tional_db_data to extract_tabular_db_data Functions actively normalize and coerce DataFrame types rather than just loading, so the new names better reflect their behaviour. Made-with: Cursor

refactor(tabular_data): rename populate_data.py to write_to_graph.py

9aa62e8

The new name better describes the file's responsibility: writing parsed tabular data as nodes and edges into Neo4j. Made-with: Cursor

pyproject

3c1dc3f

tomer-levin-nv requested review from a team as code owners March 25, 2026 15:50

tomer-levin-nv requested a review from edknv March 25, 2026 15:50

Merge branch 'main' into staging

3cc135c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tabular data ingestion#1720

tabular data ingestion#1720
tomer-levin-nv wants to merge 90 commits intoNVIDIA:mainfrom
ftatiana-nv:staging

tomer-levin-nv commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

tomer-levin-nv commented Mar 25, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants