✨(schema-bench): add dictionary-based concept matching for schema evaluation #3637

NoritakaIkeda · 2025-09-29T08:11:17Z

Issue

resolve: https://github.com/route06/liam-internal/issues/5733

Why is this change needed?

This PR adds dictionary-based concept matching to the schema benchmarking evaluation system to improve accuracy when comparing predicted and reference schemas.

Without Dictionary

Metric	Trial 1	Trial 2	Trial 3	Trial 4	Trial 5	Average	Std. Dev.
tableRecall	0.4286	0.4286	0.4286	0.5714	0.7143	0.5143	0.1210
columnRecallAverage	0.3095	0.2619	0.3333	0.4286	0.4667	0.3600	0.0790
foreignKeyRecall	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

With Dictionary

Metric	Trial 1	Trial 2	Trial 3	Trial 4	Trial 5	Average	Std. Dev.
tableRecall	0.8571	0.8571	0.7143	0.8571	0.7143	0.8000	0.0730
columnRecallAverage	0.6548	0.6071	0.5833	0.6429	0.4667	0.5909	0.0700
foreignKeyRecall	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

What this PR includes:

Core Features

Dictionary-based matching: Uses pre-registered concept aliases (e.g., "customer_info" → ["customer_info", "client_data"]) for semantic matching
Token-based fallback: Implements Jaccard similarity with configurable threshold (default 0.8) when direct aliases aren't found
Generic token filtering: Filters out common generic tokens ("info", "data", "table", etc.) to improve matching quality
Global concept dictionary: Ships with common database schema patterns and concepts

Technical Implementation

Adds dictionaryMatch() function as the first matching strategy in the evaluation pipeline
Supports concept normalization with CamelCase splitting and canonicalization
Uses Set-based Jaccard similarity for efficient token overlap scoring
Modular design allows custom dictionaries and configurable thresholds
Seamless integration with existing table and column matching workflow

Files Added

src/dictionaryMatch/dictionaryMatch.ts - Core dictionary matching implementation
src/dictionaryMatch/dictionaryMatch.test.ts - Comprehensive test coverage
src/dictionaries/global.concepts.json - Default concept dictionary with common patterns
Updated evaluation pipeline to use dictionary matching before name similarity and word overlap

Test Coverage

Tests for direct alias matching
Tests for token-based inference with various thresholds
Tests for generic token filtering behavior
Tests for integration with existing evaluation workflow

This enhancement improves schema evaluation accuracy by recognizing semantically equivalent table/column names that may use different naming conventions but represent the same concepts.

Summary by CodeRabbit

New Features
- Added concept-based synonym matching using a shared concepts dictionary to improve table/column mapping.
- Concept inference with token-similarity is applied early in evaluation to boost mapping accuracy.
- Dictionary matching utilities are now part of the package public API.
- Shipped default concept dictionary with expanded concept/alias mappings.
Tests
- Added tests validating concept lookups and synonym matching, including positive and negative cases.

…luation Implement semantic concept matching using a pre-registered dictionary of alias groups to improve schema benchmarking accuracy. Features: - Dictionary-based matching with concept aliases (e.g., "customer_info" → ["customer_info", "client_data"]) - Token-based fallback matching using Jaccard similarity with configurable threshold (default 0.8) - Generic token filtering to improve matching quality (filters out "info", "data", "table", etc.) - Global concept dictionary with common database schema patterns - Seamless integration with existing table and column matching pipeline - Comprehensive test coverage for all matching scenarios Technical details: - Adds dictionaryMatch() function as first matching strategy before name similarity and word overlap - Supports concept normalization with CamelCase splitting and canonicalization - Uses Set-based Jaccard similarity for token overlap scoring - Modular design allows custom dictionaries and configurable thresholds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

vercel · 2025-09-29T08:11:19Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
liam-app	Ready	Preview	Comment	Sep 30, 2025 0:56am
liam-assets	Ready	Preview	Comment	Sep 30, 2025 0:56am
liam-storybook	Ready	Preview	Comment	Sep 30, 2025 0:56am

2 Skipped Deployments

Project	Deployment	Preview	Comments	Updated (UTC)
liam-docs	Ignored	Preview		Sep 30, 2025 0:56am
liam-erd-sample	Skipped			Sep 30, 2025 0:56am

changeset-bot · 2025-09-29T08:11:22Z

⚠️ No Changeset found

Latest commit: 553aa58

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-09-29T08:11:25Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a concept dictionary JSON, a dictionary-based concept-matching module with tests, re-exports it from the package index, and integrates dictionary-driven matching into evaluate's table/column mapping steps before existing similarity logic.

Changes

Cohort / File(s)	Summary
Concept dictionary data `frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json`	New JSON file defining `concepts` (id + aliases) used for dictionary-driven concept mapping.
Dictionary matching implementation `frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts`, `.../dictionaryMatch/index.ts`	New module: builds an in-memory `ConceptIndex` (alias→concept, concept→aliases), normalizes/tokenizes aliases, resolves concepts (direct alias or token/Jaccard inference), and exports `buildConceptIndex` and `dictionaryMatch` which mutates provided mappings by matching references→predictions by shared concept.
Tests for dictionary matching `frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts`	New vitest suite validating positive concept matches and a negative case for non-matching names.
Evaluation integration `frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts`	Calls `dictionaryMatch(...)` at the start of `createTableMapping` and `createColumnMapping` to seed mappings before existing name-similarity and overlap logic.
Public re-exports `frontend/internal-packages/schema-bench/src/index.ts`	Re-exports dictionaryMatch API via `export * from './dictionaryMatch'`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as evaluate.ts
  participant Dict as dictionaryMatch
  participant IDX as ConceptIndex
  participant File as global.concepts.json

  Eval->>Dict: dictionaryMatch(references, predictions, mapping)
  alt index absent
    Dict->>File: load JSON
    Dict->>IDX: buildConceptIndex(dict)
  else index provided
    Note over Dict,IDX: use provided index
  end
  loop for each reference
    Dict->>IDX: conceptOf(reference)
    alt concept found
      loop over predictions
        Dict->>IDX: conceptOf(prediction)
        alt same concept & prediction unused
          Dict-->>Eval: mapping[reference]=prediction (mutate)
        end
      end
    end
  end
  Eval->>Eval: continue with name-similarity & overlap matching

sequenceDiagram
  autonumber
  participant Caller as conceptOf(name)
  participant IDX as ConceptIndex
  participant Infer as TokenInference

  Caller->>IDX: aliasToConcept lookup
  alt direct alias hit
    IDX-->>Caller: return conceptId
  else no direct hit
    Caller->>Infer: tokenize & Jaccard inference
    Infer-->>Caller: inferred conceptId or null
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Review effort 4/5

Suggested reviewers

FunamaYukina
junkisai
MH4GF

Poem

Thump-thump, I hopped through JSON rows,
Aliases twined where bright token wind blows.
Concepts found, the index hums so spry,
Mappings stitched beneath a midnight byte—
I munch a bug and scamper on by! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the main change by indicating that dictionary-based concept matching is being added to the schema-bench evaluation, and it follows a concise conventional-commit style with scope notation.
Description Check	✅ Passed	The pull request description includes the required “## Issue” section with a valid resolve link and the “## Why is this change needed?” section with a clear rationale and supporting data, fulfilling the repository’s minimal template requirements while also providing additional context on features, implementation details, and test coverage.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/schema-bench-dictionary-matching

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

supabase · 2025-09-29T08:11:32Z

Updates to Preview Branch (feature/schema-bench-dictionary-matching) ↗︎

Deployments	Status	Updated
Database	✅	Tue, 30 Sep 2025 00:53:28 UTC
Services	✅	Tue, 30 Sep 2025 00:53:28 UTC
APIs	✅	Tue, 30 Sep 2025 00:53:28 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks	Status	Updated
Configurations	✅	Tue, 30 Sep 2025 00:53:29 UTC
Migrations	✅	Tue, 30 Sep 2025 00:53:29 UTC
Seeding	✅	Tue, 30 Sep 2025 00:53:29 UTC
Edge Functions	✅	Tue, 30 Sep 2025 00:53:29 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)

48-53: Guard against existing mappings in nameSimilarity
In frontend/internal-packages/schema-bench/src/nameSimilarity/nameSimilarity.ts, wrap each reference match in a check (e.g. if (!(referenceName in mapping)) or via shouldSkipReference) before assigning to mapping, so pre-populated entries aren’t overwritten.

🧹 Nitpick comments (15)

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (2)
8-14: Add singular aliases to avoid missed matches

Include "order_detail" and "user_profile" in their respective alias lists so singular table/column names resolve to the same concept.
   {
     "id": "order_detail",
-    "aliases": ["order_details", "order_items", "purchase_items"]
+    "aliases": ["order_detail", "order_details", "order_items", "purchase_items"]
   },
   {
     "id": "user_profile",
-    "aliases": ["user_profiles", "account_info"]
+    "aliases": ["user_profile", "user_profiles", "account_info"]
   }
4-6: Consider broader customer synonyms (optional)

Optionally add "client_info" as an alias for customer_info to catch that common variant. Keep an eye on false positives.
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts (1)
1-1: Align import path style with package convention (optional)

For consistency with nearby files that include .ts extensions, consider:
-export * from './dictionaryMatch'
+export * from './dictionaryMatch.ts'
Based on learnings (CommonJS mode is used in this package).
frontend/internal-packages/schema-bench/src/index.ts (1)
1-1: Public re-export looks good; consider consistent path style (optional)

LGTM functionally. If you standardize on extensioned paths in this package, mirror it here.
-export * from './dictionaryMatch'
+export * from './dictionaryMatch/index.ts'
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (3)
17-17: Import buildConceptIndex and reuse a shared index

Build the concept index once to avoid reloading per call and pass it to both dictionaryMatch invocations.
-import { dictionaryMatch } from '../dictionaryMatch/dictionaryMatch.ts'
+import { dictionaryMatch, buildConceptIndex } from '../dictionaryMatch/dictionaryMatch.ts'
Add near the top of the module (outside functions):
// Build once per process; cheap and deterministic
const DEFAULT_CONCEPT_INDEX = buildConceptIndex()
46-47: Pass the shared concept index to table matching

Prevents redundant dictionary loads and ensures identical token filters across phases.
-  dictionaryMatch(referenceTableNames, predictTableNames, tableMapping)
+  dictionaryMatch(
+    referenceTableNames,
+    predictTableNames,
+    tableMapping,
+    undefined,
+    DEFAULT_CONCEPT_INDEX,
+  )
87-88: Pass the shared concept index to column matching

Same rationale as tables.
-  dictionaryMatch(referenceColumnNames, predictColumnNames, columnMapping)
+  dictionaryMatch(
+    referenceColumnNames,
+    predictColumnNames,
+    columnMapping,
+    undefined,
+    DEFAULT_CONCEPT_INDEX,
+  )
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (2)
2-2: Align import style (optional)

Match evaluate.ts’ style by adding the .ts extension (or switch evaluate.ts to use the barrel). Either is fine—pick one and stick to it.
-import { buildConceptIndex, dictionaryMatch } from './dictionaryMatch'
+import { buildConceptIndex, dictionaryMatch } from './dictionaryMatch.ts'
14-21: Add a tie-breaker test for Jaccard preference (optional)

Add a case with multiple candidates sharing the same concept to prove the “prefer higher token overlap” rule.

Example:
it('prefers higher token overlap among same-concept candidates', () => {
  const reference = ['order_details']                // tokens: order, details
  const predict = ['purchase_items', 'order_items']  // both same concept; 'order_items' overlaps 'order'
  const mapping: Record<string, string> = {}
  const index = buildConceptIndex()
  dictionaryMatch(reference, predict, mapping, undefined, index)
  expect(mapping).toEqual({ order_details: 'order_items' })
})
Also applies to: 23-30
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (6)
21-24: Export types used in the public API (avoid inlined anonymous types in .d.ts).

buildConceptIndex(dicts?: ConceptDictFile[]) exposes ConceptDictFile but Concept/ConceptDictFile aren’t exported. Export them to keep consumer typings clean and discoverable. Based on coding guidelines.
-type Concept = {
+export type Concept = {
   id: string
   aliases: string[]
   scope?: string[]
 }
 
-type ConceptDictFile = {
+export type ConceptDictFile = {
   concepts: Concept[]
   generic_tokens?: string[]
 }
Also applies to: 74-99

63-71: Cache the default dictionary to avoid repeated sync disk IO.

buildConceptIndex() may be called multiple times (tables, then columns). Memoize the default dict to prevent repeated fs.readFileSync.
-function loadDefaultConceptDict(): ConceptDictFile | null {
+let _cachedDefaultDict: ConceptDictFile | null | undefined
+function loadDefaultConceptDict(): ConceptDictFile | null {
+  if (_cachedDefaultDict !== undefined) return _cachedDefaultDict
   // Resolve relative to this module file so tests/CLI can locate it reliably
   const __dirnameLocal = fileURLToPath(new URL('.', import.meta.url))
   const dictPath = path.resolve(
     __dirnameLocal,
     '../dictionaries/global.concepts.json',
   )
-  return readJson(dictPath)
+  _cachedDefaultDict = readJson(dictPath)
+  return _cachedDefaultDict
}
Also applies to: 79-86

32-38: Improve CamelCase/acronym splitting in normalizeAlias.

The current regex doesn’t split acronyms well (e.g., HTTPServer → httpserver). Consider a more robust split or reuse a well‑known utility (e.g., lodash.snakeCase) to reduce false negatives in token matching.
-const camelSplit = s.replace(/([a-z0-9])([A-Z])/g, '$1_$2')
+const camelSplit = s
+  // split lower→Upper and digit→Upper
+  .replace(/([a-z0-9])([A-Z])/g, '$1_$2')
+  // split acronym→Word boundary (HTTPServer → HTTP_Server)
+  .replace(/([A-Z]+)([A-Z][a-z])/g, '$1_$2')
106-124: Precompute alias token sets in the index to speed up inference.

inferConceptByTokens recomputes toTokens for every alias on every call. For larger dictionaries this becomes hot. Store pre-tokenized alias sets per concept in ConceptIndex at build time.

Extend ConceptIndex with conceptToAliasTokenSets: Map<string, Set[]>.

Populate in buildConceptIndex.

Use precomputed sets in inferConceptByTokens.

140-182: Matching strategy could be upgraded to per‑concept maximum matching.

Within a concept group you greedily pick the best candidate per reference. Order of references can affect outcomes. If mismatches are observed, consider a per‑concept maximum‑weight matching (Hungarian/assignment) using Jaccard scores as weights.

32-58: Style: prefer const arrow functions for small utilities.

For TS utilities, our guideline prefers const arrows. Optional, but keeps style consistent in this package. Based on coding guidelines.
-function normalizeAlias(s: string): string {
+const normalizeAlias = (s: string): string => {
   ...
-}
+}
 
-function toTokens(s: string): string[] {
+const toTokens = (s: string): string[] => {
   ...
-}
+}
 
-function jaccard(a: Set<string>, b: Set<string>): number {
+const jaccard = (a: Set<string>, b: Set<string>): number => {
   ...
-}
+}
 
-function readJson(filePath: string): ConceptDictFile | null {
+const readJson = (filePath: string): ConceptDictFile | null => {
   ...
-}
+}
 
-function loadDefaultConceptDict(): ConceptDictFile | null {
+const loadDefaultConceptDict = (): ConceptDictFile | null => {
   ...
-}
+}
 
-function inferConceptByTokens(...): string | null {
+const inferConceptByTokens = (...): string | null => {
   ...
-}
+}
Also applies to: 60-71, 106-134

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d01d3c and 0b7401e.

📒 Files selected for processing (6)

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (1 hunks)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (1 hunks)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1 hunks)
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts (1 hunks)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (3 hunks)
frontend/internal-packages/schema-bench/src/index.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Name utility files in camelCase (e.g., mergeSchema.ts)

Files:

frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript/TSX across the codebase

**/*.{ts,tsx}: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})

Files:

frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow existing import patterns and tsconfig path aliases

Files:

frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Write unit tests with filenames ending in .test.ts or .test.tsx colocated near source

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts

🧠 Learnings (1)

📚 Learning: 2025-07-30T05:52:56.270Z

Learnt from: hoshinotsuyoshi
PR: liam-hq/liam#2771
File: frontend/internal-packages/schema-bench/src/cli/executeLiamDb.ts:22-22
Timestamp: 2025-07-30T05:52:56.270Z
Learning: The schema-bench package (frontend/internal-packages/schema-bench) has been converted from ESM to CommonJS mode by removing "type": "module" from package.json, making __dirname available and correct to use in TypeScript files within this package.

Applied to files:

frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts

🧬 Code graph analysis (2)

frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1)

dictionaryMatch (141-182)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (1)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (2)

buildConceptIndex (74-99)

dictionaryMatch (141-182)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Supabase Preview
GitHub Check: Supabase Preview
GitHub Check: frontend-ci
GitHub Check: Supabase Preview

🔇 Additional comments (1)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1)

9-12: Ensure JSON asset is bundled and importable

No resolveJsonModule in tsconfig → you can’t import the JSON directly in a bundler-friendly way

No dist folder or asset‐copy config found → the JSON won’t be next to compiled code at runtime
Add one of:

Enable resolveJsonModule and switch to a import globalConcepts from '../dictionaries/global.concepts.json'

Update your build config (or npm pack “files” list) to copy global.concepts.json into dist (and the published package)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

NoritakaIkeda · 2025-09-29T08:19:39Z

@codex review

chatgpt-codex-connector · 2025-09-29T08:24:57Z

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

…siness entities Add new concept mappings for customer account management and data privacy domains: - customer_account: Support for customer/account entity variations - customer_pii: Personal information handling concepts - closure_reason: Account closure and withdrawal reason classifications - account_closure: Account closure and withdrawal event tracking - data_erasure_request: GDPR/privacy compliance data deletion workflows - audit_log: General audit logging concepts This enhances dictionary-based matching for schema evaluation across different business contexts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b7401e and adc32aa.

📒 Files selected for processing (1)

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: CodeQL
GitHub Check: frontend-ci
GitHub Check: frontend-lint
GitHub Check: codeql / languages (javascript) / Perform CodeQL for javascript

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json

…ng and warnings - Add warning when no concept dictionaries are loaded to improve debugging visibility - Handle duplicate aliases gracefully by keeping first mapping and logging conflicts - Prevent silent failures when dictionary-based matching is disabled - Improve developer experience with clear diagnostic messages This ensures dictionary matching behavior is transparent and debuggable when issues occur. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adc32aa and 549f665.

📒 Files selected for processing (1)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Name utility files in camelCase (e.g., mergeSchema.ts)

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript/TSX across the codebase

**/*.{ts,tsx}: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow existing import patterns and tsconfig path aliases

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: frontend-lint
GitHub Check: frontend-ci

coderabbitai · 2025-09-29T09:38:36Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+      const normAliases = c.aliases.map((a) => normalizeAlias(a))
+      conceptToAliases.set(c.id, normAliases)
+      for (const a of normAliases) {


⚠️ Potential issue | 🟠 Major

Merge aliases instead of overwriting per concept.

Line 100 replaces any previously registered aliases for the same concept ID. When consumers supply an additional dictionary to extend an existing concept (e.g., adding more synonyms on top of the shipped global file), the earlier alias tokens vanish from conceptToAliases. As soon as that happens, token-based inference (and any heuristics depending on those tokens) can no longer recover the original matches, so valid reference/candidate pairs start failing silently. Please merge with the existing list instead of overwriting.

const normAliases = c.aliases.map((a) => normalizeAlias(a)) - conceptToAliases.set(c.id, normAliases) + const existing = conceptToAliases.get(c.id) + const merged = existing + ? Array.from(new Set([...existing, ...normAliases])) + : normAliases + conceptToAliases.set(c.id, merged)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const normAliases = c.aliases.map((a) => normalizeAlias(a))

conceptToAliases.set(c.id, normAliases)

for (const a of normAliases) {

const normAliases = c.aliases.map((a) => normalizeAlias(a))

const existing = conceptToAliases.get(c.id)

const merged = existing

? Array.from(new Set([...existing, ...normAliases]))

: normAliases

conceptToAliases.set(c.id, merged)

for (const a of normAliases) {

🤖 Prompt for AI Agents

In frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts around lines 99 to 101, the code currently assigns conceptToAliases.set(c.id, normAliases) which overwrites any previously registered aliases for the same concept ID; change this to merge with existing aliases instead of replacing them by fetching the current alias array (if any), concatenating the new normalized aliases, deduplicating (preserve order or use a Set), and then storing the merged list back into conceptToAliases so previously supplied dictionaries that extend a concept do not lose earlier tokens.

…chitecture - Simplify buildConceptIndex to use only default dictionary loading - Update function signatures and variable names for better clarity - Add comprehensive error handling and warnings for dictionary operations - Update test file to match new function signatures - Clean up code structure for better maintainability This refactoring improves the dictionary matching system's reliability and makes the codebase more maintainable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Rename function from conceptOf to resolveConceptId for better clarity - Update parameter names: references → referenceNames, candidates → predictNames for precision - Add closure_reason alias to improve concept matching coverage - Improve variable naming throughout dictionaryMatch function for better readability This refactoring enhances code clarity while maintaining the same functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Copilot

Pull Request Overview

This PR adds dictionary-based concept matching to improve schema evaluation accuracy by recognizing semantically equivalent table/column names that use different naming conventions but represent the same concepts.

Implements dictionary-based matching using pre-registered concept aliases as the first matching strategy
Adds token-based fallback using Jaccard similarity when direct aliases aren't found
Includes a global concept dictionary with common database schema patterns

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/index.ts	Exports the new dictionaryMatch module
src/evaluate/evaluate.ts	Integrates dictionary matching into table and column mapping pipelines
src/dictionaryMatch/index.ts	Module export file for dictionary matching functionality
src/dictionaryMatch/dictionaryMatch.ts	Core implementation of dictionary-based concept matching
src/dictionaryMatch/dictionaryMatch.test.ts	Test suite covering concept matching scenarios
src/dictionaries/global.concepts.json	Default concept dictionary with common schema patterns

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-30T00:56:04Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity
+export function dictionaryMatch(
+  referenceNames: string[],
+  predictNames: string[],
+  mapping: Mapping,
+): void {
+  const localIndex = buildConceptIndex()
+
+  // Track used candidates to avoid duplicates
+  const usedPredictNames = new Set(Object.values(mapping))
+
+  // Precompute candidate concepts
+  const predictConceptIdByName = new Map<string, string | null>()
+  for (const predictName of predictNames) {
+    predictConceptIdByName.set(
+      predictName,
+      resolveConceptId(predictName, localIndex),
+    )
+  }
+
+  for (const referenceName of referenceNames) {
+    if (mapping[referenceName] !== undefined) continue
+    const referenceConceptId = resolveConceptId(referenceName, localIndex)
+    if (!referenceConceptId) continue
+
+    // Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
+    let bestPredict: { name: string; score: number } | null = null
+    for (const predictName of predictNames) {
+      if (usedPredictNames.has(predictName)) continue
+      const predictConceptId = predictConceptIdByName.get(predictName)
+      if (!predictConceptId || predictConceptId !== referenceConceptId) continue
+      // Score by token Jaccard
+      const similarityScore = jaccard(
+        new Set(toTokens(referenceName)),
+        new Set(toTokens(predictName)),
+      )
+      if (!bestPredict || similarityScore > bestPredict.score)
+        bestPredict = { name: predictName, score: similarityScore }
+    }
+    if (bestPredict) {
+      mapping[referenceName] = bestPredict.name
+      usedPredictNames.add(bestPredict.name)
+    }
+  }
+}


The TODO comment indicates acknowledged technical debt. Consider breaking down this function into smaller, more focused functions to improve maintainability and readability.

Suggested change

// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity

export function dictionaryMatch(

referenceNames: string[],

predictNames: string[],

mapping: Mapping,

): void {

const localIndex = buildConceptIndex()

// Track used candidates to avoid duplicates

const usedPredictNames = new Set(Object.values(mapping))

// Precompute candidate concepts

const predictConceptIdByName = new Map<string, string | null>()

for (const predictName of predictNames) {

predictConceptIdByName.set(

predictName,

resolveConceptId(predictName, localIndex),

)

}

for (const referenceName of referenceNames) {

if (mapping[referenceName] !== undefined) continue

const referenceConceptId = resolveConceptId(referenceName, localIndex)

if (!referenceConceptId) continue

// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap

let bestPredict: { name: string; score: number } | null = null

for (const predictName of predictNames) {

if (usedPredictNames.has(predictName)) continue

const predictConceptId = predictConceptIdByName.get(predictName)

if (!predictConceptId || predictConceptId !== referenceConceptId) continue

// Score by token Jaccard

const similarityScore = jaccard(

new Set(toTokens(referenceName)),

new Set(toTokens(predictName)),

)

if (!bestPredict || similarityScore > bestPredict.score)

bestPredict = { name: predictName, score: similarityScore }

}

if (bestPredict) {

mapping[referenceName] = bestPredict.name

usedPredictNames.add(bestPredict.name)

}

}

}

// Refactored to reduce complexity by extracting helper functions

export function dictionaryMatch(

referenceNames: string[],

predictNames: string[],

mapping: Mapping,

): void {

const localIndex = buildConceptIndex()

const usedPredictNames = new Set(Object.values(mapping))

const predictConceptIdByName = precomputePredictConceptIds(predictNames, localIndex)

for (const referenceName of referenceNames) {

if (mapping[referenceName] !== undefined) continue

const referenceConceptId = resolveConceptId(referenceName, localIndex)

if (!referenceConceptId) continue

const bestPredict = findBestPredict(

referenceName,

referenceConceptId,

predictNames,

predictConceptIdByName,

usedPredictNames,

)

if (bestPredict) {

mapping[referenceName] = bestPredict

usedPredictNames.add(bestPredict)

}

}

}

/**

* Precompute concept IDs for each predicted name.

*/

function precomputePredictConceptIds(

predictNames: string[],

index: ConceptIndex,

): Map<string, string | null> {

const predictConceptIdByName = new Map<string, string | null>()

for (const predictName of predictNames) {

predictConceptIdByName.set(

predictName,

resolveConceptId(predictName, index),

)

}

return predictConceptIdByName

}

/**

* Find the best unused predicted name for a reference, matching the concept ID and maximizing token overlap.

*/

function findBestPredict(

referenceName: string,

referenceConceptId: string,

predictNames: string[],

predictConceptIdByName: Map<string, string | null>,

usedPredictNames: Set<string>,

): string | null {

let bestPredict: { name: string; score: number } | null = null

for (const predictName of predictNames) {

if (usedPredictNames.has(predictName)) continue

const predictConceptId = predictConceptIdByName.get(predictName)

if (!predictConceptId || predictConceptId !== referenceConceptId) continue

const similarityScore = jaccard(

new Set(toTokens(referenceName)),

new Set(toTokens(predictName)),

)

if (!bestPredict || similarityScore > bestPredict.score)

bestPredict = { name: predictName, score: similarityScore }

}

return bestPredict ? bestPredict.name : null

}

Copilot · 2025-09-30T00:56:05Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+    console.warn(
+      '[schema-bench] No concept dictionaries loaded; dictionaryMatch will be a no-op.',
+    )


[nitpick] Using console.warn for library warnings may not be appropriate in all environments. Consider using a proper logging framework or making the warning behavior configurable.

Copilot · 2025-09-30T00:56:05Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+          console.warn(
+            `[schema-bench] Duplicate alias "${normalizedAlias}" for concepts "${existingConceptId}" and "${concept.id}". Keeping "${existingConceptId}".`,
+          )


[nitpick] Using console.warn for duplicate alias warnings may not be appropriate in all environments. Consider using a proper logging framework or making the warning behavior configurable.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (3)

68-76: Consider logging errors in readJsonFile for debuggability.

The function silently returns null on any error, which may hide IO issues, permission problems, or malformed JSON. While the calling code handles the null case, logging the error would help troubleshoot deployment or configuration issues.

 function readJsonFile(filePath: string): ConceptDictionaryFile | null {
   try {
     const raw = fs.readFileSync(filePath, 'utf8')
     const parsed = JSON.parse(raw)
     return parsed
-  } catch {
+  } catch (error) {
+    console.warn(`[schema-bench] Failed to read dictionary from ${filePath}:`, error)
     return null
   }
 }

129-145: Consider making the Jaccard threshold configurable.

The 0.8 threshold is hardcoded on line 143. Different use cases may benefit from tuning this parameter. Consider accepting it as an optional parameter to dictionaryMatch with 0.8 as the default.

 function inferConceptByTokens(
   name: string,
   index: ConceptIndex,
+  threshold = 0.8,
 ): string | null {
   const tokens = toTokens(name)
   const tokenSet = new Set(tokens)
   let best: { id: string; score: number } | null = null
   for (const [id, aliases] of index.conceptToAliases) {
     for (const alias of aliases) {
       const aliasTokens = toTokens(alias)
       const score = jaccard(tokenSet, new Set(aliasTokens))
       if (!best || score > best.score) best = { id, score }
     }
   }
-  const threshold = 0.8
   return best && best.score >= threshold ? best.id : null
 }

193-205: Optional: Extract candidate selection logic to reduce complexity.

The nested loop at lines 193-205 could be extracted into a helper function to address the cognitive complexity warning and improve testability.

function findBestMatchingPredict(
  referenceName: string,
  referenceConceptId: string,
  predictNames: string[],
  usedPredictNames: Set<string>,
  predictConceptIdByName: Map<string, string | null>,
): string | null {
  let bestPredict: { name: string; score: number } | null = null
  for (const predictName of predictNames) {
    if (usedPredictNames.has(predictName)) continue
    const predictConceptId = predictConceptIdByName.get(predictName)
    if (!predictConceptId || predictConceptId !== referenceConceptId) continue
    const similarityScore = jaccard(
      new Set(toTokens(referenceName)),
      new Set(toTokens(predictName)),
    )
    if (!bestPredict || similarityScore > bestPredict.score)
      bestPredict = { name: predictName, score: similarityScore }
  }
  return bestPredict?.name ?? null
}

Then simplify the main loop:

     // Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
-    let bestPredict: { name: string; score: number } | null = null
-    for (const predictName of predictNames) {
-      if (usedPredictNames.has(predictName)) continue
-      const predictConceptId = predictConceptIdByName.get(predictName)
-      if (!predictConceptId || predictConceptId !== referenceConceptId) continue
-      // Score by token Jaccard
-      const similarityScore = jaccard(
-        new Set(toTokens(referenceName)),
-        new Set(toTokens(predictName)),
-      )
-      if (!bestPredict || similarityScore > bestPredict.score)
-        bestPredict = { name: predictName, score: similarityScore }
-    }
-    if (bestPredict) {
-      mapping[referenceName] = bestPredict.name
-      usedPredictNames.add(bestPredict.name)
+    const bestPredictName = findBestMatchingPredict(
+      referenceName,
+      referenceConceptId,
+      predictNames,
+      usedPredictNames,
+      predictConceptIdByName,
+    )
+    if (bestPredictName) {
+      mapping[referenceName] = bestPredictName
+      usedPredictNames.add(bestPredictName)
     }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed15b73 and 553aa58.

📒 Files selected for processing (2)

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (1 hunks)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json

🧰 Additional context used

📓 Path-based instructions (4)

**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Name utility files in camelCase (e.g., mergeSchema.ts)

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript/TSX across the codebase

**/*.{ts,tsx}: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

frontend/internal-packages/**

📄 CodeRabbit inference engine (AGENTS.md)

Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

frontend/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow existing import patterns and tsconfig path aliases

Files:

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Supabase Preview
GitHub Check: frontend-ci
GitHub Check: frontend-lint
GitHub Check: codeql / languages (javascript) / Perform CodeQL for javascript
GitHub Check: Supabase Preview

🔇 Additional comments (2)

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (2)
13-48: LGTM! Type definitions are clear and well-documented.

The type hierarchy is appropriate for the dictionary-based matching system. The bidirectional index structure in ConceptIndex enables efficient lookups in both directions.

106-124: Merge aliases per concept ID instead of overwriting.

Line 111 still overwrites previously registered aliases for the same concept ID, despite the past review comment being marked as addressed. When multiple dictionaries extend the same concept with additional aliases, earlier aliases are lost, breaking token-based inference.

Apply this diff to merge aliases:
       const normalizedAliases = concept.aliases.map((alias) =>
         normalizeAlias(alias),
       )
-      conceptToAliases.set(concept.id, normalizedAliases)
+      const existingAliases = conceptToAliases.get(concept.id)
+      const mergedAliases = existingAliases
+        ? Array.from(new Set([...existingAliases, ...normalizedAliases]))
+        : normalizedAliases
+      conceptToAliases.set(concept.id, mergedAliases)
       for (const normalizedAlias of normalizedAliases) {

coderabbitai · 2025-09-30T00:56:12Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+function inferConceptByTokens(
+  name: string,
+  index: ConceptIndex,
+): string | null {
+  const tokens = toTokens(name)
+  const tokenSet = new Set(tokens)
+  let best: { id: string; score: number } | null = null
+  for (const [id, aliases] of index.conceptToAliases) {
+    for (const alias of aliases) {
+      const aliasTokens = toTokens(alias)
+      const score = jaccard(tokenSet, new Set(aliasTokens))
+      if (!best || score > best.score) best = { id, score }
+    }
+  }
+  const threshold = 0.8
+  return best && best.score >= threshold ? best.id : null
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify: Is generic token filtering intentionally omitted?

The PR objectives mention "Generic token filtering to remove common tokens (info, data, table, etc.)" as a core feature, but inferConceptByTokens uses all tokens from the input without filtering. Was this feature deferred, or should common tokens be filtered before computing Jaccard similarity?

Run this script to check if generic token filtering exists elsewhere:

🏁 Script executed:

#!/bin/bash # Description: Search for generic token filtering implementation # Search for filter/blacklist patterns related to common tokens rg -n -C3 'info|data|table' --type ts -g 'schema-bench/**' | rg -i 'filter|blacklist|stopword|generic.*token' # Search for any token filtering logic ast-grep --pattern 'const $FILTER_TOKENS = [$$$]' # Check if filtering is mentioned in tests rg -n 'generic.*token|common.*token|filter.*token' --type ts -g '**/dictionaryMatch.test.ts'

Length of output: 69155

Implement generic token filtering or update PR objectives

Function inferConceptByTokens (frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts:129-145) currently includes all tokens when computing Jaccard similarity, without removing common tokens (e.g., info, data, table) as described in the PR objectives. Add filtering of these generic tokens prior to scoring, or remove “Generic token filtering” from the feature list.

🤖 Prompt for AI Agents

In frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts around lines 129 to 145, inferConceptByTokens currently uses all tokens for Jaccard scoring but must filter out generic tokens (e.g., "info", "data", "table") as stated in the PR objectives; implement this by defining a small set/array of lowercased generic tokens, filter the tokenSet and each aliasTokens array to remove any tokens in that set (normalize to lower case and strip empties) before computing the jaccard score, then continue the existing scoring/threshold logic (or, if you prefer not to implement filtering, remove "Generic token filtering" from the feature list in the PR instead).

coderabbitai · 2025-09-30T00:56:12Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+export function dictionaryMatch(
+  referenceNames: string[],
+  predictNames: string[],
+  mapping: Mapping,
+): void {
+  const localIndex = buildConceptIndex()
+
+  // Track used candidates to avoid duplicates
+  const usedPredictNames = new Set(Object.values(mapping))
+
+  // Precompute candidate concepts
+  const predictConceptIdByName = new Map<string, string | null>()
+  for (const predictName of predictNames) {
+    predictConceptIdByName.set(
+      predictName,
+      resolveConceptId(predictName, localIndex),
+    )
+  }
+
+  for (const referenceName of referenceNames) {
+    if (mapping[referenceName] !== undefined) continue
+    const referenceConceptId = resolveConceptId(referenceName, localIndex)
+    if (!referenceConceptId) continue
+
+    // Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap
+    let bestPredict: { name: string; score: number } | null = null
+    for (const predictName of predictNames) {
+      if (usedPredictNames.has(predictName)) continue
+      const predictConceptId = predictConceptIdByName.get(predictName)
+      if (!predictConceptId || predictConceptId !== referenceConceptId) continue
+      // Score by token Jaccard
+      const similarityScore = jaccard(
+        new Set(toTokens(referenceName)),
+        new Set(toTokens(predictName)),
+      )
+      if (!bestPredict || similarityScore > bestPredict.score)
+        bestPredict = { name: predictName, score: similarityScore }
+    }
+    if (bestPredict) {
+      mapping[referenceName] = bestPredict.name
+      usedPredictNames.add(bestPredict.name)
+    }
+  }
+}


⚠️ Potential issue | 🟠 Major

Optimize: Cache or inject the concept index to avoid rebuilding.

Line 173 calls buildConceptIndex() on every invocation of dictionaryMatch, which rereads and reparses the dictionary file and rebuilds the maps. If this function is called multiple times during evaluation (e.g., for tables and columns separately), this becomes a significant performance bottleneck.

Consider one of these approaches:

Option 1: Module-level cache (simplest)

+let cachedIndex: ConceptIndex | null = null + export function dictionaryMatch( referenceNames: string[], predictNames: string[], mapping: Mapping, ): void { - const localIndex = buildConceptIndex() + if (!cachedIndex) { + cachedIndex = buildConceptIndex() + } + const localIndex = cachedIndex

Option 2: Accept index as parameter (more flexible)

export function dictionaryMatch( referenceNames: string[], predictNames: string[], mapping: Mapping, + index?: ConceptIndex, ): void { - const localIndex = buildConceptIndex() + const localIndex = index ?? buildConceptIndex()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export function dictionaryMatch(

referenceNames: string[],

predictNames: string[],

mapping: Mapping,

): void {

const localIndex = buildConceptIndex()

// Track used candidates to avoid duplicates

const usedPredictNames = new Set(Object.values(mapping))

// Precompute candidate concepts

const predictConceptIdByName = new Map<string, string | null>()

for (const predictName of predictNames) {

predictConceptIdByName.set(

predictName,

resolveConceptId(predictName, localIndex),

)

}

for (const referenceName of referenceNames) {

if (mapping[referenceName] !== undefined) continue

const referenceConceptId = resolveConceptId(referenceName, localIndex)

if (!referenceConceptId) continue

// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap

let bestPredict: { name: string; score: number } | null = null

for (const predictName of predictNames) {

if (usedPredictNames.has(predictName)) continue

const predictConceptId = predictConceptIdByName.get(predictName)

if (!predictConceptId || predictConceptId !== referenceConceptId) continue

// Score by token Jaccard

const similarityScore = jaccard(

new Set(toTokens(referenceName)),

new Set(toTokens(predictName)),

)

if (!bestPredict || similarityScore > bestPredict.score)

bestPredict = { name: predictName, score: similarityScore }

}

if (bestPredict) {

mapping[referenceName] = bestPredict.name

usedPredictNames.add(bestPredict.name)

}

}

}

// Add a one-time cache for the concept index

let cachedIndex: ConceptIndex | null = null

export function dictionaryMatch(

referenceNames: string[],

predictNames: string[],

mapping: Mapping,

): void {

// Reuse the cached index if available

if (!cachedIndex) {

cachedIndex = buildConceptIndex()

}

const localIndex = cachedIndex

// Track used candidates to avoid duplicates

const usedPredictNames = new Set(Object.values(mapping))

// Precompute candidate concepts

const predictConceptIdByName = new Map<string, string | null>()

for (const predictName of predictNames) {

predictConceptIdByName.set(

predictName,

resolveConceptId(predictName, localIndex),

)

}

for (const referenceName of referenceNames) {

if (mapping[referenceName] !== undefined) continue

const referenceConceptId = resolveConceptId(referenceName, localIndex)

if (!referenceConceptId) continue

// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap

let bestPredict: { name: string; score: number } | null = null

for (const predictName of predictNames) {

if (usedPredictNames.has(predictName)) continue

const predictConceptId = predictConceptIdByName.get(predictName)

if (!predictConceptId || predictConceptId !== referenceConceptId) continue

// Score by token Jaccard

const similarityScore = jaccard(

new Set(toTokens(referenceName)),

new Set(toTokens(predictName)),

)

if (!bestPredict || similarityScore > bestPredict.score)

bestPredict = { name: predictName, score: similarityScore }

}

if (bestPredict) {

mapping[referenceName] = bestPredict.name

usedPredictNames.add(bestPredict.name)

}

}

}

hoshinotsuyoshi · 2025-09-30T00:57:55Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts

+import { describe, expect, it } from 'vitest'
+import { dictionaryMatch } from './dictionaryMatch'
+
+describe('dictionaryMatch (concept-based)', () => {


NoritakaIkeda · 2025-10-01T23:28:12Z

@MH4GF @junkisai @sasamuku
Someone started reviewing this PR, but it hasn’t been finished yet. I’d appreciate it if someone else could take over the review.🙏

MH4GF · 2025-10-01T23:42:44Z

@NoritakaIkeda Sorry for late 🙏🏻 I'm reviewing now!

MH4GF

Thanks!

MH4GF · 2025-10-01T23:47:02Z

frontend/internal-packages/schema-bench/src/index.ts

@@ -1,3 +1,4 @@
+export * from './dictionaryMatch'


Is this export necessary?

MH4GF · 2025-10-01T23:48:15Z

frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts

 ): Promise<Mapping> => {
  const columnMapping: Mapping = {}

  // NOTE: Implement synonym matching if needed


This comment should have been implemented this time. Shouldn't you delete it?

MH4GF · 2025-10-01T23:48:23Z

frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts

 ): Promise<Mapping> => {
  const tableMapping: Mapping = {}

  // NOTE: Implement synonym matching if needed


This comment should have been implemented this time. Shouldn't you delete it?

MH4GF · 2025-10-01T23:49:18Z

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts

+ * - References with no resolvable concept remain unmapped.
+ */
+
+// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity


Ideally, I'd prefer not to have any new code created that violates the rules, but this time it's acceptable as is.

NoritakaIkeda · 2025-10-02T00:43:18Z

#3637 (comment)
#3637 (comment)
#3637 (comment)

I want to fix these three immediately, but since I need to merge this PR urgently, I'll merge it now. I'll address them in the next PR.

vercel bot deployed to Preview – liam-storybook September 29, 2025 08:11 View deployment

github-actions bot assigned NoritakaIkeda Sep 29, 2025

vercel bot deployed to Preview – liam-assets September 29, 2025 08:11 View deployment

vercel bot deployed to Preview – liam-app September 29, 2025 08:17 View deployment

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts Outdated Show resolved Hide resolved

frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview – liam-erd-sample September 29, 2025 09:29 Inactive

vercel bot deployed to Preview – liam-assets September 29, 2025 09:29 View deployment

vercel bot deployed to Preview – liam-storybook September 29, 2025 09:29 View deployment

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json Show resolved Hide resolved

vercel bot deployed to Preview – liam-app September 29, 2025 09:33 View deployment

vercel bot temporarily deployed to Preview – liam-erd-sample September 29, 2025 09:34 Inactive

vercel bot deployed to Preview – liam-assets September 29, 2025 09:34 View deployment

vercel bot deployed to Preview – liam-storybook September 29, 2025 09:35 View deployment

vercel bot deployed to Preview – liam-app September 29, 2025 09:38 View deployment

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

vercel bot temporarily deployed to Preview – liam-erd-sample September 29, 2025 13:06 Inactive

vercel bot deployed to Preview – liam-assets September 29, 2025 13:06 View deployment

vercel bot deployed to Preview – liam-storybook September 29, 2025 13:06 View deployment

vercel bot deployed to Preview – liam-app September 29, 2025 13:10 View deployment

vercel bot temporarily deployed to Preview – liam-erd-sample September 30, 2025 00:52 Inactive

vercel bot deployed to Preview – liam-assets September 30, 2025 00:52 View deployment

vercel bot deployed to Preview – liam-storybook September 30, 2025 00:53 View deployment

NoritakaIkeda marked this pull request as ready for review September 30, 2025 00:54

NoritakaIkeda requested a review from a team as a code owner September 30, 2025 00:54

NoritakaIkeda requested review from FunamaYukina, MH4GF, Copilot, hoshinotsuyoshi and junkisai and removed request for a team September 30, 2025 00:54

Copilot AI reviewed Sep 30, 2025

View reviewed changes

vercel bot deployed to Preview – liam-app September 30, 2025 00:56 View deployment

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

hoshinotsuyoshi reviewed Sep 30, 2025

View reviewed changes

MH4GF approved these changes Oct 1, 2025

View reviewed changes

NoritakaIkeda added this pull request to the merge queue Oct 2, 2025

Merged via the queue into main with commit 9f36bde Oct 2, 2025
40 checks passed

NoritakaIkeda deleted the feature/schema-bench-dictionary-matching branch October 2, 2025 00:48

✨(schema-bench): add dictionary-based concept matching for schema evaluation #3637

✨(schema-bench): add dictionary-based concept matching for schema evaluation #3637

Uh oh!

Conversation

NoritakaIkeda commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Why is this change needed?

What this PR includes:

Core Features

Technical Implementation

Files Added

Test Coverage

Summary by CodeRabbit

Uh oh!

vercel bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

supabase bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NoritakaIkeda commented Sep 29, 2025

Uh oh!

chatgpt-codex-connector bot commented Sep 29, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

hoshinotsuyoshi Sep 30, 2025

NoritakaIkeda commented Sep 29, 2025 •

edited

Loading

vercel bot commented Sep 29, 2025 •

edited

Loading

changeset-bot bot commented Sep 29, 2025 •

edited

Loading

coderabbitai bot commented Sep 29, 2025 •

edited

Loading

supabase bot commented Sep 29, 2025 •

edited

Loading

NoritakaIkeda commented Oct 1, 2025 •

edited

Loading