Skip to content

tabular data ingestion#1720

Open
tomer-levin-nv wants to merge 90 commits intoNVIDIA:mainfrom
ftatiana-nv:staging
Open

tabular data ingestion#1720
tomer-levin-nv wants to merge 90 commits intoNVIDIA:mainfrom
ftatiana-nv:staging

Conversation

@tomer-levin-nv
Copy link
Copy Markdown
Collaborator

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

tomer-levin-nv and others added 26 commits March 25, 2026 17:32
…hemas_dal

- extract_data.py: add missing `queries` variable to tuple unpack from create_dataframe (returned 6 values, expected 5)
- schemas_dal.py: add missing commas between all fields in WITH clause and collect({}) map literal in get_schema_columns query

Made-with: Cursor
…on and ExtractParams defaults

- Rename EmbedParams.embedding_api_key -> api_key and update all call sites (batch.py, inprocess.py)
- Add strict modality validation to EmbedParams (raise on invalid, replace silent image_text remap)
- Add VALID_EMBED_MODALITIES constant; narrow IMAGE_MODALITIES to exclude image_text
- Update ExtractParams defaults: method="pdfium", image_format="jpeg", jpeg_quality=100, render_mode="fit_to_model"
- Raise TextChunkParams.max_tokens default from 512 to 1024
- Simplify executor.run_mode_ingest: remove embed_params/vdb_params args (callers set these on ingestor directly)
- Remove unused metrics_parser.py
- Add client dependency tweak (client/pyproject.toml)

Made-with: Cursor
Renames all public API methods, params classes, internal attributes,
constants, and helper functions that used 'structured' to 'tabular'
to better reflect that the pipeline operates on relational/tabular data.

Key changes:
- Params: Structured*Params → Tabular*Params (and remove unused TabularPIIParams)
- Ingestor methods: pull/store/populate/generate/get_*structured* → *tabular*
- BatchIngestor: ingest_structured → ingest_tabular; _structured_* attrs → _tabular_*
- Executor/runners: run_mode_ingest_structured → run_mode_ingest_tabular;
  run_batch_structured → run_batch_tabular (file renamed accordingly)
- LanceDB table constant: _STRUCTURED_TABLE/"nv-ingest-structured" → _TABULAR_TABLE/"nv-ingest-tabular"
- Helper: data_for_populate_structured → data_for_populate_tabular

Made-with: Cursor
…nectors/ingestion/retrieval/neo4j

Renames the top-level folder from relational_db to tabular_data and
restructures its contents into four clear sub-packages:

- connectors/   — DB connectors (DuckDB, Spider2)
- ingestion/    — extract_data, population/graph, prepare_for_embedding
- retrieval/    — generate_sql (merged from generate_sql/ facade + sql_tool/)
- neo4j/        — Neo4j connection management (was neo4j_connection/)

Updates all import paths across the codebase accordingly.

Made-with: Cursor
…Cypher queries

- Add Edges class to reserved_words.py with CONTAINS, CONNECTING, FOREIGN_KEY constants
- Make RelTypes inherit from Edges for backward compatibility
- Replace all hardcoded node labels and relationship type strings in db_dal.py with Labels/Edges constants
- Remove is_temp guard and its dependency on Labels.TEMP_SCHEMA in update_diff_from_existing_schema

Made-with: Cursor
…are_embedding_text sibling

- Remove the population/ wrapper directory; graph/ and populate_data.py now live directly under ingestion/
- Move prepare_for_embedding/prepare_embedding_text.py to ingestion/prepare_embedding_text.py and remove the now-empty subdirectory
- Update all ingestion.population.* imports to ingestion.* across the codebase

Made-with: Cursor
- Move duckdb and spider2 imports to top-level; remove lazy try/except blocks
- Rename prepare_embedding_text.py to embeddings.py; update import in batch.py
- Remove is_temp support: drop TEMP_SCHEMA/TABLE/COLUMN labels, params, and DataFrame assignments
- Remove label_to_type function and all call sites across schema, node, and utils_dal
- Remove include_deleted parameter; hardcode deleted-record filter in all DAL queries
- Delete dead tables_dal.py and unused entity_exists_in_graph_insensitive function
- Drop Connection constraint from indexes.py (label no longer exists)
- Remove NULL AS "created" column from duckdb get_tables query

Made-with: Cursor
Relocate setup_spider2.py, spider2_loader.py, and SPIDER2_SETUP.md
from connectors/ into a new nemo_retriever/tabular-dev-tools/ folder
alongside tests/. Update spider2_loader import to a local sibling
import and fix docstring run paths accordingly.

Made-with: Cursor
…ame and remove debug file

- Merge fetch_relational_db_for_embedding + neo4j_tables_result_to_embedding_dataframe into a single fetch_tabular_embedding_dataframe in embeddings.py
- Move the import to the top of batch.py; simplify call site to check df.empty directly
- Delete debug_run_mode_ingest.py (unreferenced debug script)

Made-with: Cursor
…n up schema ingestion

- Drop account_id from Neo4j uniqueness constraint and index
- Delete unused docker-compose.neo4j.yaml
- Remove table_type and ordinal_position from DuckDB schema queries
- Remove table property diffing and single-node update helper from db_dal
- Simplify column diff tracking to data_type and is_nullable only
- Add comment to update_properties_in_graph_batch

Made-with: Cursor
Drop ordinal_position, default, length, comment, and scale from the
column fetch query, keeping only data_type and is_nullable.

Made-with: Cursor
…_time, last_altered, default, length, scale from table/column models

Simplifies the schema by keeping only essential fields (created, description for tables; data_type, is_nullable, ordinal_position, description for columns) and renames comment -> description throughout.

Made-with: Cursor
The fulltext index was never queried anywhere in the codebase and was
also being redundantly re-created on every loop iteration.

Made-with: Cursor
…tional_db_data to extract_tabular_db_data

Functions actively normalize and coerce DataFrame types rather than
just loading, so the new names better reflect their behaviour.

Made-with: Cursor
The new name better describes the file's responsibility: writing
parsed tabular data as nodes and edges into Neo4j.

Made-with: Cursor
@tomer-levin-nv tomer-levin-nv requested review from a team as code owners March 25, 2026 15:50
@tomer-levin-nv tomer-levin-nv requested a review from edknv March 25, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants