Skip to content

Conversation

NoritakaIkeda
Copy link
Member

@NoritakaIkeda NoritakaIkeda commented Sep 29, 2025

Issue

Why is this change needed?

This PR adds dictionary-based concept matching to the schema benchmarking evaluation system to improve accuracy when comparing predicted and reference schemas.

Without Dictionary

Metric Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Average Std. Dev.
tableRecall 0.4286 0.4286 0.4286 0.5714 0.7143 0.5143 0.1210
columnRecallAverage 0.3095 0.2619 0.3333 0.4286 0.4667 0.3600 0.0790
foreignKeyRecall 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

With Dictionary

Metric Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Average Std. Dev.
tableRecall 0.8571 0.8571 0.7143 0.8571 0.7143 0.8000 0.0730
columnRecallAverage 0.6548 0.6071 0.5833 0.6429 0.4667 0.5909 0.0700
foreignKeyRecall 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

What this PR includes:

Core Features

  • Dictionary-based matching: Uses pre-registered concept aliases (e.g., "customer_info" → ["customer_info", "client_data"]) for semantic matching
  • Token-based fallback: Implements Jaccard similarity with configurable threshold (default 0.8) when direct aliases aren't found
  • Generic token filtering: Filters out common generic tokens ("info", "data", "table", etc.) to improve matching quality
  • Global concept dictionary: Ships with common database schema patterns and concepts

Technical Implementation

  • Adds dictionaryMatch() function as the first matching strategy in the evaluation pipeline
  • Supports concept normalization with CamelCase splitting and canonicalization
  • Uses Set-based Jaccard similarity for efficient token overlap scoring
  • Modular design allows custom dictionaries and configurable thresholds
  • Seamless integration with existing table and column matching workflow

Files Added

  • src/dictionaryMatch/dictionaryMatch.ts - Core dictionary matching implementation
  • src/dictionaryMatch/dictionaryMatch.test.ts - Comprehensive test coverage
  • src/dictionaries/global.concepts.json - Default concept dictionary with common patterns
  • Updated evaluation pipeline to use dictionary matching before name similarity and word overlap

Test Coverage

  • Tests for direct alias matching
  • Tests for token-based inference with various thresholds
  • Tests for generic token filtering behavior
  • Tests for integration with existing evaluation workflow

This enhancement improves schema evaluation accuracy by recognizing semantically equivalent table/column names that may use different naming conventions but represent the same concepts.

Summary by CodeRabbit

  • New Features

    • Added concept-based synonym matching using a shared concepts dictionary to improve table/column mapping.
    • Concept inference with token-similarity is applied early in evaluation to boost mapping accuracy.
    • Dictionary matching utilities are now part of the package public API.
    • Shipped default concept dictionary with expanded concept/alias mappings.
  • Tests

    • Added tests validating concept lookups and synonym matching, including positive and negative cases.

…luation

Implement semantic concept matching using a pre-registered dictionary of alias groups to improve schema benchmarking accuracy.

Features:
- Dictionary-based matching with concept aliases (e.g., "customer_info" → ["customer_info", "client_data"])
- Token-based fallback matching using Jaccard similarity with configurable threshold (default 0.8)
- Generic token filtering to improve matching quality (filters out "info", "data", "table", etc.)
- Global concept dictionary with common database schema patterns
- Seamless integration with existing table and column matching pipeline
- Comprehensive test coverage for all matching scenarios

Technical details:
- Adds dictionaryMatch() function as first matching strategy before name similarity and word overlap
- Supports concept normalization with CamelCase splitting and canonicalization
- Uses Set-based Jaccard similarity for token overlap scoring
- Modular design allows custom dictionaries and configurable thresholds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link

vercel bot commented Sep 29, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
liam-app Ready Ready Preview Comment Sep 30, 2025 0:56am
liam-assets Ready Ready Preview Comment Sep 30, 2025 0:56am
liam-storybook Ready Ready Preview Comment Sep 30, 2025 0:56am
2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
liam-docs Ignored Ignored Preview Sep 30, 2025 0:56am
liam-erd-sample Skipped Skipped Sep 30, 2025 0:56am

Copy link

changeset-bot bot commented Sep 29, 2025

⚠️ No Changeset found

Latest commit: 553aa58

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Sep 29, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a concept dictionary JSON, a dictionary-based concept-matching module with tests, re-exports it from the package index, and integrates dictionary-driven matching into evaluate's table/column mapping steps before existing similarity logic.

Changes

Cohort / File(s) Summary
Concept dictionary data
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
New JSON file defining concepts (id + aliases) used for dictionary-driven concept mapping.
Dictionary matching implementation
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts, .../dictionaryMatch/index.ts
New module: builds an in-memory ConceptIndex (alias→concept, concept→aliases), normalizes/tokenizes aliases, resolves concepts (direct alias or token/Jaccard inference), and exports buildConceptIndex and dictionaryMatch which mutates provided mappings by matching references→predictions by shared concept.
Tests for dictionary matching
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
New vitest suite validating positive concept matches and a negative case for non-matching names.
Evaluation integration
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
Calls dictionaryMatch(...) at the start of createTableMapping and createColumnMapping to seed mappings before existing name-similarity and overlap logic.
Public re-exports
frontend/internal-packages/schema-bench/src/index.ts
Re-exports dictionaryMatch API via export * from './dictionaryMatch'.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as evaluate.ts
  participant Dict as dictionaryMatch
  participant IDX as ConceptIndex
  participant File as global.concepts.json

  Eval->>Dict: dictionaryMatch(references, predictions, mapping)
  alt index absent
    Dict->>File: load JSON
    Dict->>IDX: buildConceptIndex(dict)
  else index provided
    Note over Dict,IDX: use provided index
  end
  loop for each reference
    Dict->>IDX: conceptOf(reference)
    alt concept found
      loop over predictions
        Dict->>IDX: conceptOf(prediction)
        alt same concept & prediction unused
          Dict-->>Eval: mapping[reference]=prediction (mutate)
        end
      end
    end
  end
  Eval->>Eval: continue with name-similarity & overlap matching
Loading
sequenceDiagram
  autonumber
  participant Caller as conceptOf(name)
  participant IDX as ConceptIndex
  participant Infer as TokenInference

  Caller->>IDX: aliasToConcept lookup
  alt direct alias hit
    IDX-->>Caller: return conceptId
  else no direct hit
    Caller->>Infer: tokenize & Jaccard inference
    Infer-->>Caller: inferred conceptId or null
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Review effort 4/5

Suggested reviewers

  • FunamaYukina
  • junkisai
  • MH4GF

Poem

Thump-thump, I hopped through JSON rows,
Aliases twined where bright token wind blows.
Concepts found, the index hums so spry,
Mappings stitched beneath a midnight byte—
I munch a bug and scamper on by! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the main change by indicating that dictionary-based concept matching is being added to the schema-bench evaluation, and it follows a concise conventional-commit style with scope notation.
Description Check ✅ Passed The pull request description includes the required “## Issue” section with a valid resolve link and the “## Why is this change needed?” section with a clear rationale and supporting data, fulfilling the repository’s minimal template requirements while also providing additional context on features, implementation details, and test coverage.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/schema-bench-dictionary-matching

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

supabase bot commented Sep 29, 2025

Updates to Preview Branch (feature/schema-bench-dictionary-matching) ↗︎

Deployments Status Updated
Database Tue, 30 Sep 2025 00:53:28 UTC
Services Tue, 30 Sep 2025 00:53:28 UTC
APIs Tue, 30 Sep 2025 00:53:28 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Tue, 30 Sep 2025 00:53:29 UTC
Migrations Tue, 30 Sep 2025 00:53:29 UTC
Seeding Tue, 30 Sep 2025 00:53:29 UTC
Edge Functions Tue, 30 Sep 2025 00:53:29 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)

48-53: Guard against existing mappings in nameSimilarity
In frontend/internal-packages/schema-bench/src/nameSimilarity/nameSimilarity.ts, wrap each reference match in a check (e.g. if (!(referenceName in mapping)) or via shouldSkipReference) before assigning to mapping, so pre-populated entries aren’t overwritten.

🧹 Nitpick comments (15)
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (2)

8-14: Add singular aliases to avoid missed matches

Include "order_detail" and "user_profile" in their respective alias lists so singular table/column names resolve to the same concept.

   {
     "id": "order_detail",
-    "aliases": ["order_details", "order_items", "purchase_items"]
+    "aliases": ["order_detail", "order_details", "order_items", "purchase_items"]
   },
   {
     "id": "user_profile",
-    "aliases": ["user_profiles", "account_info"]
+    "aliases": ["user_profile", "user_profiles", "account_info"]
   }

4-6: Consider broader customer synonyms (optional)

Optionally add "client_info" as an alias for customer_info to catch that common variant. Keep an eye on false positives.

frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts (1)

1-1: Align import path style with package convention (optional)

For consistency with nearby files that include .ts extensions, consider:

-export * from './dictionaryMatch'
+export * from './dictionaryMatch.ts'

Based on learnings (CommonJS mode is used in this package).

frontend/internal-packages/schema-bench/src/index.ts (1)

1-1: Public re-export looks good; consider consistent path style (optional)

LGTM functionally. If you standardize on extensioned paths in this package, mirror it here.

-export * from './dictionaryMatch'
+export * from './dictionaryMatch/index.ts'
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (3)

17-17: Import buildConceptIndex and reuse a shared index

Build the concept index once to avoid reloading per call and pass it to both dictionaryMatch invocations.

-import { dictionaryMatch } from '../dictionaryMatch/dictionaryMatch.ts'
+import { dictionaryMatch, buildConceptIndex } from '../dictionaryMatch/dictionaryMatch.ts'

Add near the top of the module (outside functions):

// Build once per process; cheap and deterministic
const DEFAULT_CONCEPT_INDEX = buildConceptIndex()

46-47: Pass the shared concept index to table matching

Prevents redundant dictionary loads and ensures identical token filters across phases.

-  dictionaryMatch(referenceTableNames, predictTableNames, tableMapping)
+  dictionaryMatch(
+    referenceTableNames,
+    predictTableNames,
+    tableMapping,
+    undefined,
+    DEFAULT_CONCEPT_INDEX,
+  )

87-88: Pass the shared concept index to column matching

Same rationale as tables.

-  dictionaryMatch(referenceColumnNames, predictColumnNames, columnMapping)
+  dictionaryMatch(
+    referenceColumnNames,
+    predictColumnNames,
+    columnMapping,
+    undefined,
+    DEFAULT_CONCEPT_INDEX,
+  )
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (2)

2-2: Align import style (optional)

Match evaluate.ts’ style by adding the .ts extension (or switch evaluate.ts to use the barrel). Either is fine—pick one and stick to it.

-import { buildConceptIndex, dictionaryMatch } from './dictionaryMatch'
+import { buildConceptIndex, dictionaryMatch } from './dictionaryMatch.ts'

14-21: Add a tie-breaker test for Jaccard preference (optional)

Add a case with multiple candidates sharing the same concept to prove the “prefer higher token overlap” rule.

Example:

it('prefers higher token overlap among same-concept candidates', () => {
  const reference = ['order_details']                // tokens: order, details
  const predict = ['purchase_items', 'order_items']  // both same concept; 'order_items' overlaps 'order'
  const mapping: Record<string, string> = {}
  const index = buildConceptIndex()
  dictionaryMatch(reference, predict, mapping, undefined, index)
  expect(mapping).toEqual({ order_details: 'order_items' })
})

Also applies to: 23-30

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (6)

21-24: Export types used in the public API (avoid inlined anonymous types in .d.ts).

buildConceptIndex(dicts?: ConceptDictFile[]) exposes ConceptDictFile but Concept/ConceptDictFile aren’t exported. Export them to keep consumer typings clean and discoverable. Based on coding guidelines.

-type Concept = {
+export type Concept = {
   id: string
   aliases: string[]
   scope?: string[]
 }
 
-type ConceptDictFile = {
+export type ConceptDictFile = {
   concepts: Concept[]
   generic_tokens?: string[]
 }

Also applies to: 74-99


63-71: Cache the default dictionary to avoid repeated sync disk IO.

buildConceptIndex() may be called multiple times (tables, then columns). Memoize the default dict to prevent repeated fs.readFileSync.

-function loadDefaultConceptDict(): ConceptDictFile | null {
+let _cachedDefaultDict: ConceptDictFile | null | undefined
+function loadDefaultConceptDict(): ConceptDictFile | null {
+  if (_cachedDefaultDict !== undefined) return _cachedDefaultDict
   // Resolve relative to this module file so tests/CLI can locate it reliably
   const __dirnameLocal = fileURLToPath(new URL('.', import.meta.url))
   const dictPath = path.resolve(
     __dirnameLocal,
     '../dictionaries/global.concepts.json',
   )
-  return readJson(dictPath)
+  _cachedDefaultDict = readJson(dictPath)
+  return _cachedDefaultDict
}

Also applies to: 79-86


32-38: Improve CamelCase/acronym splitting in normalizeAlias.

The current regex doesn’t split acronyms well (e.g., HTTPServer → httpserver). Consider a more robust split or reuse a well‑known utility (e.g., lodash.snakeCase) to reduce false negatives in token matching.

-const camelSplit = s.replace(/([a-z0-9])([A-Z])/g, '$1_$2')
+const camelSplit = s
+  // split lower→Upper and digit→Upper
+  .replace(/([a-z0-9])([A-Z])/g, '$1_$2')
+  // split acronym→Word boundary (HTTPServer → HTTP_Server)
+  .replace(/([A-Z]+)([A-Z][a-z])/g, '$1_$2')

106-124: Precompute alias token sets in the index to speed up inference.

inferConceptByTokens recomputes toTokens for every alias on every call. For larger dictionaries this becomes hot. Store pre-tokenized alias sets per concept in ConceptIndex at build time.

  • Extend ConceptIndex with conceptToAliasTokenSets: Map<string, Set[]>.
  • Populate in buildConceptIndex.
  • Use precomputed sets in inferConceptByTokens.

140-182: Matching strategy could be upgraded to per‑concept maximum matching.

Within a concept group you greedily pick the best candidate per reference. Order of references can affect outcomes. If mismatches are observed, consider a per‑concept maximum‑weight matching (Hungarian/assignment) using Jaccard scores as weights.


32-58: Style: prefer const arrow functions for small utilities.

For TS utilities, our guideline prefers const arrows. Optional, but keeps style consistent in this package. Based on coding guidelines.

-function normalizeAlias(s: string): string {
+const normalizeAlias = (s: string): string => {
   ...
-}
+}
 
-function toTokens(s: string): string[] {
+const toTokens = (s: string): string[] => {
   ...
-}
+}
 
-function jaccard(a: Set<string>, b: Set<string>): number {
+const jaccard = (a: Set<string>, b: Set<string>): number => {
   ...
-}
+}
 
-function readJson(filePath: string): ConceptDictFile | null {
+const readJson = (filePath: string): ConceptDictFile | null => {
   ...
-}
+}
 
-function loadDefaultConceptDict(): ConceptDictFile | null {
+const loadDefaultConceptDict = (): ConceptDictFile | null => {
   ...
-}
+}
 
-function inferConceptByTokens(...): string | null {
+const inferConceptByTokens = (...): string | null => {
   ...
-}
+}

Also applies to: 60-71, 106-134

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d01d3c and 0b7401e.

📒 Files selected for processing (6)
  • frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (1 hunks)
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (1 hunks)
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1 hunks)
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts (1 hunks)
  • frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (3 hunks)
  • frontend/internal-packages/schema-bench/src/index.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Name utility files in camelCase (e.g., mergeSchema.ts)

Files:

  • frontend/internal-packages/schema-bench/src/index.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
  • frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript/TSX across the codebase

**/*.{ts,tsx}: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})

Files:

  • frontend/internal-packages/schema-bench/src/index.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
  • frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

  • frontend/internal-packages/schema-bench/src/index.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
  • frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
  • frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow existing import patterns and tsconfig path aliases

Files:

  • frontend/internal-packages/schema-bench/src/index.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
  • frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Write unit tests with filenames ending in .test.ts or .test.tsx colocated near source

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
🧠 Learnings (1)
📚 Learning: 2025-07-30T05:52:56.270Z
Learnt from: hoshinotsuyoshi
PR: liam-hq/liam#2771
File: frontend/internal-packages/schema-bench/src/cli/executeLiamDb.ts:22-22
Timestamp: 2025-07-30T05:52:56.270Z
Learning: The schema-bench package (frontend/internal-packages/schema-bench) has been converted from ESM to CommonJS mode by removing "type": "module" from package.json, making __dirname available and correct to use in TypeScript files within this package.

Applied to files:

  • frontend/internal-packages/schema-bench/src/index.ts
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
🧬 Code graph analysis (2)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1)
  • dictionaryMatch (141-182)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (2)
  • buildConceptIndex (74-99)
  • dictionaryMatch (141-182)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Supabase Preview
  • GitHub Check: Supabase Preview
  • GitHub Check: frontend-ci
  • GitHub Check: Supabase Preview
🔇 Additional comments (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1)

9-12: Ensure JSON asset is bundled and importable

  • No resolveJsonModule in tsconfig → you can’t import the JSON directly in a bundler-friendly way
  • No dist folder or asset‐copy config found → the JSON won’t be next to compiled code at runtime
    Add one of:
  • Enable resolveJsonModule and switch to a import globalConcepts from '../dictionaries/global.concepts.json'
  • Update your build config (or npm pack “files” list) to copy global.concepts.json into dist (and the published package)

@NoritakaIkeda
Copy link
Member Author

@codex review

Copy link

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

…siness entities

Add new concept mappings for customer account management and data privacy domains:
- customer_account: Support for customer/account entity variations
- customer_pii: Personal information handling concepts
- closure_reason: Account closure and withdrawal reason classifications
- account_closure: Account closure and withdrawal event tracking
- data_erasure_request: GDPR/privacy compliance data deletion workflows
- audit_log: General audit logging concepts

This enhances dictionary-based matching for schema evaluation across different business contexts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b7401e and adc32aa.

📒 Files selected for processing (1)
  • frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

  • frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: CodeQL
  • GitHub Check: frontend-ci
  • GitHub Check: frontend-lint
  • GitHub Check: codeql / languages (javascript) / Perform CodeQL for javascript

…ng and warnings

- Add warning when no concept dictionaries are loaded to improve debugging visibility
- Handle duplicate aliases gracefully by keeping first mapping and logging conflicts
- Prevent silent failures when dictionary-based matching is disabled
- Improve developer experience with clear diagnostic messages

This ensures dictionary matching behavior is transparent and debuggable when issues occur.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adc32aa and 549f665.

📒 Files selected for processing (1)
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Name utility files in camelCase (e.g., mergeSchema.ts)

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript/TSX across the codebase

**/*.{ts,tsx}: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow existing import patterns and tsconfig path aliases

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: frontend-lint
  • GitHub Check: frontend-ci

Comment on lines 99 to 101
const normAliases = c.aliases.map((a) => normalizeAlias(a))
conceptToAliases.set(c.id, normAliases)
for (const a of normAliases) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Merge aliases instead of overwriting per concept.

Line 100 replaces any previously registered aliases for the same concept ID. When consumers supply an additional dictionary to extend an existing concept (e.g., adding more synonyms on top of the shipped global file), the earlier alias tokens vanish from conceptToAliases. As soon as that happens, token-based inference (and any heuristics depending on those tokens) can no longer recover the original matches, so valid reference/candidate pairs start failing silently. Please merge with the existing list instead of overwriting.

       const normAliases = c.aliases.map((a) => normalizeAlias(a))
-      conceptToAliases.set(c.id, normAliases)
+      const existing = conceptToAliases.get(c.id)
+      const merged = existing
+        ? Array.from(new Set([...existing, ...normAliases]))
+        : normAliases
+      conceptToAliases.set(c.id, merged)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const normAliases = c.aliases.map((a) => normalizeAlias(a))
conceptToAliases.set(c.id, normAliases)
for (const a of normAliases) {
const normAliases = c.aliases.map((a) => normalizeAlias(a))
const existing = conceptToAliases.get(c.id)
const merged = existing
? Array.from(new Set([...existing, ...normAliases]))
: normAliases
conceptToAliases.set(c.id, merged)
for (const a of normAliases) {
🤖 Prompt for AI Agents
In
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
around lines 99 to 101, the code currently assigns conceptToAliases.set(c.id,
normAliases) which overwrites any previously registered aliases for the same
concept ID; change this to merge with existing aliases instead of replacing them
by fetching the current alias array (if any), concatenating the new normalized
aliases, deduplicating (preserve order or use a Set), and then storing the
merged list back into conceptToAliases so previously supplied dictionaries that
extend a concept do not lose earlier tokens.

…chitecture

- Simplify buildConceptIndex to use only default dictionary loading
- Update function signatures and variable names for better clarity
- Add comprehensive error handling and warnings for dictionary operations
- Update test file to match new function signatures
- Clean up code structure for better maintainability

This refactoring improves the dictionary matching system's reliability and makes the codebase more maintainable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Rename function from conceptOf to resolveConceptId for better clarity
- Update parameter names: references → referenceNames, candidates → predictNames for precision
- Add closure_reason alias to improve concept matching coverage
- Improve variable naming throughout dictionaryMatch function for better readability

This refactoring enhances code clarity while maintaining the same functionality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds dictionary-based concept matching to improve schema evaluation accuracy by recognizing semantically equivalent table/column names that use different naming conventions but represent the same concepts.

  • Implements dictionary-based matching using pre-registered concept aliases as the first matching strategy
  • Adds token-based fallback using Jaccard similarity when direct aliases aren't found
  • Includes a global concept dictionary with common database schema patterns

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/index.ts Exports the new dictionaryMatch module
src/evaluate/evaluate.ts Integrates dictionary matching into table and column mapping pipelines
src/dictionaryMatch/index.ts Module export file for dictionary matching functionality
src/dictionaryMatch/dictionaryMatch.ts Core implementation of dictionary-based concept matching
src/dictionaryMatch/dictionaryMatch.test.ts Test suite covering concept matching scenarios
src/dictionaries/global.concepts.json Default concept dictionary with common schema patterns

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +167 to +211
// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
const localIndex = buildConceptIndex()

// Track used candidates to avoid duplicates
const usedPredictNames = new Set(Object.values(mapping))

// Precompute candidate concepts
const predictConceptIdByName = new Map<string, string | null>()
for (const predictName of predictNames) {
predictConceptIdByName.set(
predictName,
resolveConceptId(predictName, localIndex),
)
}

for (const referenceName of referenceNames) {
if (mapping[referenceName] !== undefined) continue
const referenceConceptId = resolveConceptId(referenceName, localIndex)
if (!referenceConceptId) continue

// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
let bestPredict: { name: string; score: number } | null = null
for (const predictName of predictNames) {
if (usedPredictNames.has(predictName)) continue
const predictConceptId = predictConceptIdByName.get(predictName)
if (!predictConceptId || predictConceptId !== referenceConceptId) continue
// Score by token Jaccard
const similarityScore = jaccard(
new Set(toTokens(referenceName)),
new Set(toTokens(predictName)),
)
if (!bestPredict || similarityScore > bestPredict.score)
bestPredict = { name: predictName, score: similarityScore }
}
if (bestPredict) {
mapping[referenceName] = bestPredict.name
usedPredictNames.add(bestPredict.name)
}
}
}
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment indicates acknowledged technical debt. Consider breaking down this function into smaller, more focused functions to improve maintainability and readability.

Suggested change
// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
const localIndex = buildConceptIndex()
// Track used candidates to avoid duplicates
const usedPredictNames = new Set(Object.values(mapping))
// Precompute candidate concepts
const predictConceptIdByName = new Map<string, string | null>()
for (const predictName of predictNames) {
predictConceptIdByName.set(
predictName,
resolveConceptId(predictName, localIndex),
)
}
for (const referenceName of referenceNames) {
if (mapping[referenceName] !== undefined) continue
const referenceConceptId = resolveConceptId(referenceName, localIndex)
if (!referenceConceptId) continue
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
let bestPredict: { name: string; score: number } | null = null
for (const predictName of predictNames) {
if (usedPredictNames.has(predictName)) continue
const predictConceptId = predictConceptIdByName.get(predictName)
if (!predictConceptId || predictConceptId !== referenceConceptId) continue
// Score by token Jaccard
const similarityScore = jaccard(
new Set(toTokens(referenceName)),
new Set(toTokens(predictName)),
)
if (!bestPredict || similarityScore > bestPredict.score)
bestPredict = { name: predictName, score: similarityScore }
}
if (bestPredict) {
mapping[referenceName] = bestPredict.name
usedPredictNames.add(bestPredict.name)
}
}
}
// Refactored to reduce complexity by extracting helper functions
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
const localIndex = buildConceptIndex()
const usedPredictNames = new Set(Object.values(mapping))
const predictConceptIdByName = precomputePredictConceptIds(predictNames, localIndex)
for (const referenceName of referenceNames) {
if (mapping[referenceName] !== undefined) continue
const referenceConceptId = resolveConceptId(referenceName, localIndex)
if (!referenceConceptId) continue
const bestPredict = findBestPredict(
referenceName,
referenceConceptId,
predictNames,
predictConceptIdByName,
usedPredictNames,
)
if (bestPredict) {
mapping[referenceName] = bestPredict
usedPredictNames.add(bestPredict)
}
}
}
/**
* Precompute concept IDs for each predicted name.
*/
function precomputePredictConceptIds(
predictNames: string[],
index: ConceptIndex,
): Map<string, string | null> {
const predictConceptIdByName = new Map<string, string | null>()
for (const predictName of predictNames) {
predictConceptIdByName.set(
predictName,
resolveConceptId(predictName, index),
)
}
return predictConceptIdByName
}
/**
* Find the best unused predicted name for a reference, matching the concept ID and maximizing token overlap.
*/
function findBestPredict(
referenceName: string,
referenceConceptId: string,
predictNames: string[],
predictConceptIdByName: Map<string, string | null>,
usedPredictNames: Set<string>,
): string | null {
let bestPredict: { name: string; score: number } | null = null
for (const predictName of predictNames) {
if (usedPredictNames.has(predictName)) continue
const predictConceptId = predictConceptIdByName.get(predictName)
if (!predictConceptId || predictConceptId !== referenceConceptId) continue
const similarityScore = jaccard(
new Set(toTokens(referenceName)),
new Set(toTokens(predictName)),
)
if (!bestPredict || similarityScore > bestPredict.score)
bestPredict = { name: predictName, score: similarityScore }
}
return bestPredict ? bestPredict.name : null
}

Copilot uses AI. Check for mistakes.

Comment on lines +101 to +103
console.warn(
'[schema-bench] No concept dictionaries loaded; dictionaryMatch will be a no-op.',
)
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using console.warn for library warnings may not be appropriate in all environments. Consider using a proper logging framework or making the warning behavior configurable.

Copilot uses AI. Check for mistakes.

Comment on lines +116 to +118
console.warn(
`[schema-bench] Duplicate alias "${normalizedAlias}" for concepts "${existingConceptId}" and "${concept.id}". Keeping "${existingConceptId}".`,
)
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using console.warn for duplicate alias warnings may not be appropriate in all environments. Consider using a proper logging framework or making the warning behavior configurable.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (3)

68-76: Consider logging errors in readJsonFile for debuggability.

The function silently returns null on any error, which may hide IO issues, permission problems, or malformed JSON. While the calling code handles the null case, logging the error would help troubleshoot deployment or configuration issues.

 function readJsonFile(filePath: string): ConceptDictionaryFile | null {
   try {
     const raw = fs.readFileSync(filePath, 'utf8')
     const parsed = JSON.parse(raw)
     return parsed
-  } catch {
+  } catch (error) {
+    console.warn(`[schema-bench] Failed to read dictionary from ${filePath}:`, error)
     return null
   }
 }

129-145: Consider making the Jaccard threshold configurable.

The 0.8 threshold is hardcoded on line 143. Different use cases may benefit from tuning this parameter. Consider accepting it as an optional parameter to dictionaryMatch with 0.8 as the default.

 function inferConceptByTokens(
   name: string,
   index: ConceptIndex,
+  threshold = 0.8,
 ): string | null {
   const tokens = toTokens(name)
   const tokenSet = new Set(tokens)
   let best: { id: string; score: number } | null = null
   for (const [id, aliases] of index.conceptToAliases) {
     for (const alias of aliases) {
       const aliasTokens = toTokens(alias)
       const score = jaccard(tokenSet, new Set(aliasTokens))
       if (!best || score > best.score) best = { id, score }
     }
   }
-  const threshold = 0.8
   return best && best.score >= threshold ? best.id : null
 }

193-205: Optional: Extract candidate selection logic to reduce complexity.

The nested loop at lines 193-205 could be extracted into a helper function to address the cognitive complexity warning and improve testability.

function findBestMatchingPredict(
  referenceName: string,
  referenceConceptId: string,
  predictNames: string[],
  usedPredictNames: Set<string>,
  predictConceptIdByName: Map<string, string | null>,
): string | null {
  let bestPredict: { name: string; score: number } | null = null
  for (const predictName of predictNames) {
    if (usedPredictNames.has(predictName)) continue
    const predictConceptId = predictConceptIdByName.get(predictName)
    if (!predictConceptId || predictConceptId !== referenceConceptId) continue
    const similarityScore = jaccard(
      new Set(toTokens(referenceName)),
      new Set(toTokens(predictName)),
    )
    if (!bestPredict || similarityScore > bestPredict.score)
      bestPredict = { name: predictName, score: similarityScore }
  }
  return bestPredict?.name ?? null
}

Then simplify the main loop:

     // Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
-    let bestPredict: { name: string; score: number } | null = null
-    for (const predictName of predictNames) {
-      if (usedPredictNames.has(predictName)) continue
-      const predictConceptId = predictConceptIdByName.get(predictName)
-      if (!predictConceptId || predictConceptId !== referenceConceptId) continue
-      // Score by token Jaccard
-      const similarityScore = jaccard(
-        new Set(toTokens(referenceName)),
-        new Set(toTokens(predictName)),
-      )
-      if (!bestPredict || similarityScore > bestPredict.score)
-        bestPredict = { name: predictName, score: similarityScore }
-    }
-    if (bestPredict) {
-      mapping[referenceName] = bestPredict.name
-      usedPredictNames.add(bestPredict.name)
+    const bestPredictName = findBestMatchingPredict(
+      referenceName,
+      referenceConceptId,
+      predictNames,
+      usedPredictNames,
+      predictConceptIdByName,
+    )
+    if (bestPredictName) {
+      mapping[referenceName] = bestPredictName
+      usedPredictNames.add(bestPredictName)
     }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed15b73 and 553aa58.

📒 Files selected for processing (2)
  • frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (1 hunks)
  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
🧰 Additional context used
📓 Path-based instructions (4)
**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Name utility files in camelCase (e.g., mergeSchema.ts)

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript/TSX across the codebase

**/*.{ts,tsx}: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow existing import patterns and tsconfig path aliases

Files:

  • frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Supabase Preview
  • GitHub Check: frontend-ci
  • GitHub Check: frontend-lint
  • GitHub Check: codeql / languages (javascript) / Perform CodeQL for javascript
  • GitHub Check: Supabase Preview
🔇 Additional comments (2)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (2)

13-48: LGTM! Type definitions are clear and well-documented.

The type hierarchy is appropriate for the dictionary-based matching system. The bidirectional index structure in ConceptIndex enables efficient lookups in both directions.


106-124: Merge aliases per concept ID instead of overwriting.

Line 111 still overwrites previously registered aliases for the same concept ID, despite the past review comment being marked as addressed. When multiple dictionaries extend the same concept with additional aliases, earlier aliases are lost, breaking token-based inference.

Apply this diff to merge aliases:

       const normalizedAliases = concept.aliases.map((alias) =>
         normalizeAlias(alias),
       )
-      conceptToAliases.set(concept.id, normalizedAliases)
+      const existingAliases = conceptToAliases.get(concept.id)
+      const mergedAliases = existingAliases
+        ? Array.from(new Set([...existingAliases, ...normalizedAliases]))
+        : normalizedAliases
+      conceptToAliases.set(concept.id, mergedAliases)
       for (const normalizedAlias of normalizedAliases) {

Comment on lines +129 to +145
function inferConceptByTokens(
name: string,
index: ConceptIndex,
): string | null {
const tokens = toTokens(name)
const tokenSet = new Set(tokens)
let best: { id: string; score: number } | null = null
for (const [id, aliases] of index.conceptToAliases) {
for (const alias of aliases) {
const aliasTokens = toTokens(alias)
const score = jaccard(tokenSet, new Set(aliasTokens))
if (!best || score > best.score) best = { id, score }
}
}
const threshold = 0.8
return best && best.score >= threshold ? best.id : null
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify: Is generic token filtering intentionally omitted?

The PR objectives mention "Generic token filtering to remove common tokens (info, data, table, etc.)" as a core feature, but inferConceptByTokens uses all tokens from the input without filtering. Was this feature deferred, or should common tokens be filtered before computing Jaccard similarity?

Run this script to check if generic token filtering exists elsewhere:


🏁 Script executed:

#!/bin/bash
# Description: Search for generic token filtering implementation

# Search for filter/blacklist patterns related to common tokens
rg -n -C3 'info|data|table' --type ts -g 'schema-bench/**' | rg -i 'filter|blacklist|stopword|generic.*token'

# Search for any token filtering logic
ast-grep --pattern 'const $FILTER_TOKENS = [$$$]'

# Check if filtering is mentioned in tests
rg -n 'generic.*token|common.*token|filter.*token' --type ts -g '**/dictionaryMatch.test.ts'

Length of output: 69155


Implement generic token filtering or update PR objectives

Function inferConceptByTokens (frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts:129-145) currently includes all tokens when computing Jaccard similarity, without removing common tokens (e.g., info, data, table) as described in the PR objectives. Add filtering of these generic tokens prior to scoring, or remove “Generic token filtering” from the feature list.

🤖 Prompt for AI Agents
In
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
around lines 129 to 145, inferConceptByTokens currently uses all tokens for
Jaccard scoring but must filter out generic tokens (e.g., "info", "data",
"table") as stated in the PR objectives; implement this by defining a small
set/array of lowercased generic tokens, filter the tokenSet and each aliasTokens
array to remove any tokens in that set (normalize to lower case and strip
empties) before computing the jaccard score, then continue the existing
scoring/threshold logic (or, if you prefer not to implement filtering, remove
"Generic token filtering" from the feature list in the PR instead).

Comment on lines +168 to +211
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
const localIndex = buildConceptIndex()

// Track used candidates to avoid duplicates
const usedPredictNames = new Set(Object.values(mapping))

// Precompute candidate concepts
const predictConceptIdByName = new Map<string, string | null>()
for (const predictName of predictNames) {
predictConceptIdByName.set(
predictName,
resolveConceptId(predictName, localIndex),
)
}

for (const referenceName of referenceNames) {
if (mapping[referenceName] !== undefined) continue
const referenceConceptId = resolveConceptId(referenceName, localIndex)
if (!referenceConceptId) continue

// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
let bestPredict: { name: string; score: number } | null = null
for (const predictName of predictNames) {
if (usedPredictNames.has(predictName)) continue
const predictConceptId = predictConceptIdByName.get(predictName)
if (!predictConceptId || predictConceptId !== referenceConceptId) continue
// Score by token Jaccard
const similarityScore = jaccard(
new Set(toTokens(referenceName)),
new Set(toTokens(predictName)),
)
if (!bestPredict || similarityScore > bestPredict.score)
bestPredict = { name: predictName, score: similarityScore }
}
if (bestPredict) {
mapping[referenceName] = bestPredict.name
usedPredictNames.add(bestPredict.name)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Optimize: Cache or inject the concept index to avoid rebuilding.

Line 173 calls buildConceptIndex() on every invocation of dictionaryMatch, which rereads and reparses the dictionary file and rebuilds the maps. If this function is called multiple times during evaluation (e.g., for tables and columns separately), this becomes a significant performance bottleneck.

Consider one of these approaches:

Option 1: Module-level cache (simplest)

+let cachedIndex: ConceptIndex | null = null
+
 export function dictionaryMatch(
   referenceNames: string[],
   predictNames: string[],
   mapping: Mapping,
 ): void {
-  const localIndex = buildConceptIndex()
+  if (!cachedIndex) {
+    cachedIndex = buildConceptIndex()
+  }
+  const localIndex = cachedIndex

Option 2: Accept index as parameter (more flexible)

 export function dictionaryMatch(
   referenceNames: string[],
   predictNames: string[],
   mapping: Mapping,
+  index?: ConceptIndex,
 ): void {
-  const localIndex = buildConceptIndex()
+  const localIndex = index ?? buildConceptIndex()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
const localIndex = buildConceptIndex()
// Track used candidates to avoid duplicates
const usedPredictNames = new Set(Object.values(mapping))
// Precompute candidate concepts
const predictConceptIdByName = new Map<string, string | null>()
for (const predictName of predictNames) {
predictConceptIdByName.set(
predictName,
resolveConceptId(predictName, localIndex),
)
}
for (const referenceName of referenceNames) {
if (mapping[referenceName] !== undefined) continue
const referenceConceptId = resolveConceptId(referenceName, localIndex)
if (!referenceConceptId) continue
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
let bestPredict: { name: string; score: number } | null = null
for (const predictName of predictNames) {
if (usedPredictNames.has(predictName)) continue
const predictConceptId = predictConceptIdByName.get(predictName)
if (!predictConceptId || predictConceptId !== referenceConceptId) continue
// Score by token Jaccard
const similarityScore = jaccard(
new Set(toTokens(referenceName)),
new Set(toTokens(predictName)),
)
if (!bestPredict || similarityScore > bestPredict.score)
bestPredict = { name: predictName, score: similarityScore }
}
if (bestPredict) {
mapping[referenceName] = bestPredict.name
usedPredictNames.add(bestPredict.name)
}
}
}
// Add a one-time cache for the concept index
let cachedIndex: ConceptIndex | null = null
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
// Reuse the cached index if available
if (!cachedIndex) {
cachedIndex = buildConceptIndex()
}
const localIndex = cachedIndex
// Track used candidates to avoid duplicates
const usedPredictNames = new Set(Object.values(mapping))
// Precompute candidate concepts
const predictConceptIdByName = new Map<string, string | null>()
for (const predictName of predictNames) {
predictConceptIdByName.set(
predictName,
resolveConceptId(predictName, localIndex),
)
}
for (const referenceName of referenceNames) {
if (mapping[referenceName] !== undefined) continue
const referenceConceptId = resolveConceptId(referenceName, localIndex)
if (!referenceConceptId) continue
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
let bestPredict: { name: string; score: number } | null = null
for (const predictName of predictNames) {
if (usedPredictNames.has(predictName)) continue
const predictConceptId = predictConceptIdByName.get(predictName)
if (!predictConceptId || predictConceptId !== referenceConceptId) continue
// Score by token Jaccard
const similarityScore = jaccard(
new Set(toTokens(referenceName)),
new Set(toTokens(predictName)),
)
if (!bestPredict || similarityScore > bestPredict.score)
bestPredict = { name: predictName, score: similarityScore }
}
if (bestPredict) {
mapping[referenceName] = bestPredict.name
usedPredictNames.add(bestPredict.name)
}
}
}

import { describe, expect, it } from 'vitest'
import { dictionaryMatch } from './dictionaryMatch'

describe('dictionaryMatch (concept-based)', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@NoritakaIkeda
Copy link
Member Author

NoritakaIkeda commented Oct 1, 2025

@MH4GF @junkisai @sasamuku
Someone started reviewing this PR, but it hasn’t been finished yet. I’d appreciate it if someone else could take over the review.🙏

@MH4GF
Copy link
Member

MH4GF commented Oct 1, 2025

@NoritakaIkeda Sorry for late 🙏🏻 I'm reviewing now!

Copy link
Member

@MH4GF MH4GF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -1,3 +1,4 @@
export * from './dictionaryMatch'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this export necessary?

): Promise<Mapping> => {
const columnMapping: Mapping = {}

// NOTE: Implement synonym matching if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should have been implemented this time. Shouldn't you delete it?

): Promise<Mapping> => {
const tableMapping: Mapping = {}

// NOTE: Implement synonym matching if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should have been implemented this time. Shouldn't you delete it?

* - References with no resolvable concept remain unmapped.
*/

// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd prefer not to have any new code created that violates the rules, but this time it's acceptable as is.

@NoritakaIkeda
Copy link
Member Author

#3637 (comment)
#3637 (comment)
#3637 (comment)

I want to fix these three immediately, but since I need to merge this PR urgently, I'll merge it now. I'll address them in the next PR.

@NoritakaIkeda NoritakaIkeda added this pull request to the merge queue Oct 2, 2025
Merged via the queue into main with commit 9f36bde Oct 2, 2025
40 checks passed
@NoritakaIkeda NoritakaIkeda deleted the feature/schema-bench-dictionary-matching branch October 2, 2025 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants