-
Notifications
You must be signed in to change notification settings - Fork 176
✨(schema-bench): add dictionary-based concept matching for schema evaluation #3637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…luation Implement semantic concept matching using a pre-registered dictionary of alias groups to improve schema benchmarking accuracy. Features: - Dictionary-based matching with concept aliases (e.g., "customer_info" → ["customer_info", "client_data"]) - Token-based fallback matching using Jaccard similarity with configurable threshold (default 0.8) - Generic token filtering to improve matching quality (filters out "info", "data", "table", etc.) - Global concept dictionary with common database schema patterns - Seamless integration with existing table and column matching pipeline - Comprehensive test coverage for all matching scenarios Technical details: - Adds dictionaryMatch() function as first matching strategy before name similarity and word overlap - Supports concept normalization with CamelCase splitting and canonicalization - Uses Set-based Jaccard similarity for token overlap scoring - Modular design allows custom dictionaries and configurable thresholds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a concept dictionary JSON, a dictionary-based concept-matching module with tests, re-exports it from the package index, and integrates dictionary-driven matching into evaluate's table/column mapping steps before existing similarity logic. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Eval as evaluate.ts
participant Dict as dictionaryMatch
participant IDX as ConceptIndex
participant File as global.concepts.json
Eval->>Dict: dictionaryMatch(references, predictions, mapping)
alt index absent
Dict->>File: load JSON
Dict->>IDX: buildConceptIndex(dict)
else index provided
Note over Dict,IDX: use provided index
end
loop for each reference
Dict->>IDX: conceptOf(reference)
alt concept found
loop over predictions
Dict->>IDX: conceptOf(prediction)
alt same concept & prediction unused
Dict-->>Eval: mapping[reference]=prediction (mutate)
end
end
end
end
Eval->>Eval: continue with name-similarity & overlap matching
sequenceDiagram
autonumber
participant Caller as conceptOf(name)
participant IDX as ConceptIndex
participant Infer as TokenInference
Caller->>IDX: aliasToConcept lookup
alt direct alias hit
IDX-->>Caller: return conceptId
else no direct hit
Caller->>Infer: tokenize & Jaccard inference
Infer-->>Caller: inferred conceptId or null
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
Updates to Preview Branch (feature/schema-bench-dictionary-matching) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)
48-53
: Guard against existing mappings in nameSimilarity
Infrontend/internal-packages/schema-bench/src/nameSimilarity/nameSimilarity.ts
, wrap each reference match in a check (e.g.if (!(referenceName in mapping))
or viashouldSkipReference
) before assigning tomapping
, so pre-populated entries aren’t overwritten.
🧹 Nitpick comments (15)
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json (2)
8-14
: Add singular aliases to avoid missed matchesInclude "order_detail" and "user_profile" in their respective alias lists so singular table/column names resolve to the same concept.
{ "id": "order_detail", - "aliases": ["order_details", "order_items", "purchase_items"] + "aliases": ["order_detail", "order_details", "order_items", "purchase_items"] }, { "id": "user_profile", - "aliases": ["user_profiles", "account_info"] + "aliases": ["user_profile", "user_profiles", "account_info"] }
4-6
: Consider broader customer synonyms (optional)Optionally add "client_info" as an alias for customer_info to catch that common variant. Keep an eye on false positives.
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts (1)
1-1
: Align import path style with package convention (optional)For consistency with nearby files that include .ts extensions, consider:
-export * from './dictionaryMatch' +export * from './dictionaryMatch.ts'Based on learnings (CommonJS mode is used in this package).
frontend/internal-packages/schema-bench/src/index.ts (1)
1-1
: Public re-export looks good; consider consistent path style (optional)LGTM functionally. If you standardize on extensioned paths in this package, mirror it here.
-export * from './dictionaryMatch' +export * from './dictionaryMatch/index.ts'frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (3)
17-17
: Import buildConceptIndex and reuse a shared indexBuild the concept index once to avoid reloading per call and pass it to both dictionaryMatch invocations.
-import { dictionaryMatch } from '../dictionaryMatch/dictionaryMatch.ts' +import { dictionaryMatch, buildConceptIndex } from '../dictionaryMatch/dictionaryMatch.ts'Add near the top of the module (outside functions):
// Build once per process; cheap and deterministic const DEFAULT_CONCEPT_INDEX = buildConceptIndex()
46-47
: Pass the shared concept index to table matchingPrevents redundant dictionary loads and ensures identical token filters across phases.
- dictionaryMatch(referenceTableNames, predictTableNames, tableMapping) + dictionaryMatch( + referenceTableNames, + predictTableNames, + tableMapping, + undefined, + DEFAULT_CONCEPT_INDEX, + )
87-88
: Pass the shared concept index to column matchingSame rationale as tables.
- dictionaryMatch(referenceColumnNames, predictColumnNames, columnMapping) + dictionaryMatch( + referenceColumnNames, + predictColumnNames, + columnMapping, + undefined, + DEFAULT_CONCEPT_INDEX, + )frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (2)
2-2
: Align import style (optional)Match evaluate.ts’ style by adding the .ts extension (or switch evaluate.ts to use the barrel). Either is fine—pick one and stick to it.
-import { buildConceptIndex, dictionaryMatch } from './dictionaryMatch' +import { buildConceptIndex, dictionaryMatch } from './dictionaryMatch.ts'
14-21
: Add a tie-breaker test for Jaccard preference (optional)Add a case with multiple candidates sharing the same concept to prove the “prefer higher token overlap” rule.
Example:
it('prefers higher token overlap among same-concept candidates', () => { const reference = ['order_details'] // tokens: order, details const predict = ['purchase_items', 'order_items'] // both same concept; 'order_items' overlaps 'order' const mapping: Record<string, string> = {} const index = buildConceptIndex() dictionaryMatch(reference, predict, mapping, undefined, index) expect(mapping).toEqual({ order_details: 'order_items' }) })Also applies to: 23-30
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (6)
21-24
: Export types used in the public API (avoid inlined anonymous types in .d.ts).buildConceptIndex(dicts?: ConceptDictFile[]) exposes ConceptDictFile but Concept/ConceptDictFile aren’t exported. Export them to keep consumer typings clean and discoverable. Based on coding guidelines.
-type Concept = { +export type Concept = { id: string aliases: string[] scope?: string[] } -type ConceptDictFile = { +export type ConceptDictFile = { concepts: Concept[] generic_tokens?: string[] }Also applies to: 74-99
63-71
: Cache the default dictionary to avoid repeated sync disk IO.buildConceptIndex() may be called multiple times (tables, then columns). Memoize the default dict to prevent repeated fs.readFileSync.
-function loadDefaultConceptDict(): ConceptDictFile | null { +let _cachedDefaultDict: ConceptDictFile | null | undefined +function loadDefaultConceptDict(): ConceptDictFile | null { + if (_cachedDefaultDict !== undefined) return _cachedDefaultDict // Resolve relative to this module file so tests/CLI can locate it reliably const __dirnameLocal = fileURLToPath(new URL('.', import.meta.url)) const dictPath = path.resolve( __dirnameLocal, '../dictionaries/global.concepts.json', ) - return readJson(dictPath) + _cachedDefaultDict = readJson(dictPath) + return _cachedDefaultDict }Also applies to: 79-86
32-38
: Improve CamelCase/acronym splitting in normalizeAlias.The current regex doesn’t split acronyms well (e.g., HTTPServer → httpserver). Consider a more robust split or reuse a well‑known utility (e.g., lodash.snakeCase) to reduce false negatives in token matching.
-const camelSplit = s.replace(/([a-z0-9])([A-Z])/g, '$1_$2') +const camelSplit = s + // split lower→Upper and digit→Upper + .replace(/([a-z0-9])([A-Z])/g, '$1_$2') + // split acronym→Word boundary (HTTPServer → HTTP_Server) + .replace(/([A-Z]+)([A-Z][a-z])/g, '$1_$2')
106-124
: Precompute alias token sets in the index to speed up inference.inferConceptByTokens recomputes toTokens for every alias on every call. For larger dictionaries this becomes hot. Store pre-tokenized alias sets per concept in ConceptIndex at build time.
- Extend ConceptIndex with conceptToAliasTokenSets: Map<string, Set[]>.
- Populate in buildConceptIndex.
- Use precomputed sets in inferConceptByTokens.
140-182
: Matching strategy could be upgraded to per‑concept maximum matching.Within a concept group you greedily pick the best candidate per reference. Order of references can affect outcomes. If mismatches are observed, consider a per‑concept maximum‑weight matching (Hungarian/assignment) using Jaccard scores as weights.
32-58
: Style: prefer const arrow functions for small utilities.For TS utilities, our guideline prefers const arrows. Optional, but keeps style consistent in this package. Based on coding guidelines.
-function normalizeAlias(s: string): string { +const normalizeAlias = (s: string): string => { ... -} +} -function toTokens(s: string): string[] { +const toTokens = (s: string): string[] => { ... -} +} -function jaccard(a: Set<string>, b: Set<string>): number { +const jaccard = (a: Set<string>, b: Set<string>): number => { ... -} +} -function readJson(filePath: string): ConceptDictFile | null { +const readJson = (filePath: string): ConceptDictFile | null => { ... -} +} -function loadDefaultConceptDict(): ConceptDictFile | null { +const loadDefaultConceptDict = (): ConceptDictFile | null => { ... -} +} -function inferConceptByTokens(...): string | null { +const inferConceptByTokens = (...): string | null => { ... -} +}Also applies to: 60-71, 106-134
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
(1 hunks)frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
(1 hunks)frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
(1 hunks)frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
(1 hunks)frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
(3 hunks)frontend/internal-packages/schema-bench/src/index.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
Name utility files in camelCase (e.g., mergeSchema.ts)
Files:
frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use TypeScript/TSX across the codebase
**/*.{ts,tsx}
: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})
Files:
frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/internal-packages/**
📄 CodeRabbit inference engine (AGENTS.md)
Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages
Files:
frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Follow existing import patterns and tsconfig path aliases
Files:
frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.test.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Write unit tests with filenames ending in .test.ts or .test.tsx colocated near source
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts
🧠 Learnings (1)
📚 Learning: 2025-07-30T05:52:56.270Z
Learnt from: hoshinotsuyoshi
PR: liam-hq/liam#2771
File: frontend/internal-packages/schema-bench/src/cli/executeLiamDb.ts:22-22
Timestamp: 2025-07-30T05:52:56.270Z
Learning: The schema-bench package (frontend/internal-packages/schema-bench) has been converted from ESM to CommonJS mode by removing "type": "module" from package.json, making __dirname available and correct to use in TypeScript files within this package.
Applied to files:
frontend/internal-packages/schema-bench/src/index.ts
frontend/internal-packages/schema-bench/src/dictionaryMatch/index.ts
🧬 Code graph analysis (2)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1)
dictionaryMatch
(141-182)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.test.ts (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (2)
buildConceptIndex
(74-99)dictionaryMatch
(141-182)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Supabase Preview
- GitHub Check: Supabase Preview
- GitHub Check: frontend-ci
- GitHub Check: Supabase Preview
🔇 Additional comments (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (1)
9-12
: Ensure JSON asset is bundled and importable
- No
resolveJsonModule
in tsconfig → you can’timport
the JSON directly in a bundler-friendly way- No
dist
folder or asset‐copy config found → the JSON won’t be next to compiled code at runtime
Add one of:- Enable
resolveJsonModule
and switch to aimport globalConcepts from '../dictionaries/global.concepts.json'
- Update your build config (or
npm pack
“files” list) to copyglobal.concepts.json
intodist
(and the published package)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
Outdated
Show resolved
Hide resolved
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
Outdated
Show resolved
Hide resolved
@codex review |
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. |
…siness entities Add new concept mappings for customer account management and data privacy domains: - customer_account: Support for customer/account entity variations - customer_pii: Personal information handling concepts - closure_reason: Account closure and withdrawal reason classifications - account_closure: Account closure and withdrawal event tracking - data_erasure_request: GDPR/privacy compliance data deletion workflows - audit_log: General audit logging concepts This enhances dictionary-based matching for schema evaluation across different business contexts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
frontend/internal-packages/**
📄 CodeRabbit inference engine (AGENTS.md)
Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages
Files:
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: CodeQL
- GitHub Check: frontend-ci
- GitHub Check: frontend-lint
- GitHub Check: codeql / languages (javascript) / Perform CodeQL for javascript
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
Show resolved
Hide resolved
…ng and warnings - Add warning when no concept dictionaries are loaded to improve debugging visibility - Handle duplicate aliases gracefully by keeping first mapping and logging conflicts - Prevent silent failures when dictionary-based matching is disabled - Improve developer experience with clear diagnostic messages This ensures dictionary matching behavior is transparent and debuggable when issues occur. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
Name utility files in camelCase (e.g., mergeSchema.ts)
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use TypeScript/TSX across the codebase
**/*.{ts,tsx}
: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/internal-packages/**
📄 CodeRabbit inference engine (AGENTS.md)
Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Follow existing import patterns and tsconfig path aliases
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: frontend-lint
- GitHub Check: frontend-ci
const normAliases = c.aliases.map((a) => normalizeAlias(a)) | ||
conceptToAliases.set(c.id, normAliases) | ||
for (const a of normAliases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge aliases instead of overwriting per concept.
Line 100 replaces any previously registered aliases for the same concept ID. When consumers supply an additional dictionary to extend an existing concept (e.g., adding more synonyms on top of the shipped global file), the earlier alias tokens vanish from conceptToAliases
. As soon as that happens, token-based inference (and any heuristics depending on those tokens) can no longer recover the original matches, so valid reference/candidate pairs start failing silently. Please merge with the existing list instead of overwriting.
const normAliases = c.aliases.map((a) => normalizeAlias(a))
- conceptToAliases.set(c.id, normAliases)
+ const existing = conceptToAliases.get(c.id)
+ const merged = existing
+ ? Array.from(new Set([...existing, ...normAliases]))
+ : normAliases
+ conceptToAliases.set(c.id, merged)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const normAliases = c.aliases.map((a) => normalizeAlias(a)) | |
conceptToAliases.set(c.id, normAliases) | |
for (const a of normAliases) { | |
const normAliases = c.aliases.map((a) => normalizeAlias(a)) | |
const existing = conceptToAliases.get(c.id) | |
const merged = existing | |
? Array.from(new Set([...existing, ...normAliases])) | |
: normAliases | |
conceptToAliases.set(c.id, merged) | |
for (const a of normAliases) { |
🤖 Prompt for AI Agents
In
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
around lines 99 to 101, the code currently assigns conceptToAliases.set(c.id,
normAliases) which overwrites any previously registered aliases for the same
concept ID; change this to merge with existing aliases instead of replacing them
by fetching the current alias array (if any), concatenating the new normalized
aliases, deduplicating (preserve order or use a Set), and then storing the
merged list back into conceptToAliases so previously supplied dictionaries that
extend a concept do not lose earlier tokens.
…chitecture - Simplify buildConceptIndex to use only default dictionary loading - Update function signatures and variable names for better clarity - Add comprehensive error handling and warnings for dictionary operations - Update test file to match new function signatures - Clean up code structure for better maintainability This refactoring improves the dictionary matching system's reliability and makes the codebase more maintainable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Rename function from conceptOf to resolveConceptId for better clarity - Update parameter names: references → referenceNames, candidates → predictNames for precision - Add closure_reason alias to improve concept matching coverage - Improve variable naming throughout dictionaryMatch function for better readability This refactoring enhances code clarity while maintaining the same functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds dictionary-based concept matching to improve schema evaluation accuracy by recognizing semantically equivalent table/column names that use different naming conventions but represent the same concepts.
- Implements dictionary-based matching using pre-registered concept aliases as the first matching strategy
- Adds token-based fallback using Jaccard similarity when direct aliases aren't found
- Includes a global concept dictionary with common database schema patterns
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/index.ts | Exports the new dictionaryMatch module |
src/evaluate/evaluate.ts | Integrates dictionary matching into table and column mapping pipelines |
src/dictionaryMatch/index.ts | Module export file for dictionary matching functionality |
src/dictionaryMatch/dictionaryMatch.ts | Core implementation of dictionary-based concept matching |
src/dictionaryMatch/dictionaryMatch.test.ts | Test suite covering concept matching scenarios |
src/dictionaries/global.concepts.json | Default concept dictionary with common schema patterns |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity | ||
export function dictionaryMatch( | ||
referenceNames: string[], | ||
predictNames: string[], | ||
mapping: Mapping, | ||
): void { | ||
const localIndex = buildConceptIndex() | ||
|
||
// Track used candidates to avoid duplicates | ||
const usedPredictNames = new Set(Object.values(mapping)) | ||
|
||
// Precompute candidate concepts | ||
const predictConceptIdByName = new Map<string, string | null>() | ||
for (const predictName of predictNames) { | ||
predictConceptIdByName.set( | ||
predictName, | ||
resolveConceptId(predictName, localIndex), | ||
) | ||
} | ||
|
||
for (const referenceName of referenceNames) { | ||
if (mapping[referenceName] !== undefined) continue | ||
const referenceConceptId = resolveConceptId(referenceName, localIndex) | ||
if (!referenceConceptId) continue | ||
|
||
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap | ||
let bestPredict: { name: string; score: number } | null = null | ||
for (const predictName of predictNames) { | ||
if (usedPredictNames.has(predictName)) continue | ||
const predictConceptId = predictConceptIdByName.get(predictName) | ||
if (!predictConceptId || predictConceptId !== referenceConceptId) continue | ||
// Score by token Jaccard | ||
const similarityScore = jaccard( | ||
new Set(toTokens(referenceName)), | ||
new Set(toTokens(predictName)), | ||
) | ||
if (!bestPredict || similarityScore > bestPredict.score) | ||
bestPredict = { name: predictName, score: similarityScore } | ||
} | ||
if (bestPredict) { | ||
mapping[referenceName] = bestPredict.name | ||
usedPredictNames.add(bestPredict.name) | ||
} | ||
} | ||
} |
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates acknowledged technical debt. Consider breaking down this function into smaller, more focused functions to improve maintainability and readability.
// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity | |
export function dictionaryMatch( | |
referenceNames: string[], | |
predictNames: string[], | |
mapping: Mapping, | |
): void { | |
const localIndex = buildConceptIndex() | |
// Track used candidates to avoid duplicates | |
const usedPredictNames = new Set(Object.values(mapping)) | |
// Precompute candidate concepts | |
const predictConceptIdByName = new Map<string, string | null>() | |
for (const predictName of predictNames) { | |
predictConceptIdByName.set( | |
predictName, | |
resolveConceptId(predictName, localIndex), | |
) | |
} | |
for (const referenceName of referenceNames) { | |
if (mapping[referenceName] !== undefined) continue | |
const referenceConceptId = resolveConceptId(referenceName, localIndex) | |
if (!referenceConceptId) continue | |
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap | |
let bestPredict: { name: string; score: number } | null = null | |
for (const predictName of predictNames) { | |
if (usedPredictNames.has(predictName)) continue | |
const predictConceptId = predictConceptIdByName.get(predictName) | |
if (!predictConceptId || predictConceptId !== referenceConceptId) continue | |
// Score by token Jaccard | |
const similarityScore = jaccard( | |
new Set(toTokens(referenceName)), | |
new Set(toTokens(predictName)), | |
) | |
if (!bestPredict || similarityScore > bestPredict.score) | |
bestPredict = { name: predictName, score: similarityScore } | |
} | |
if (bestPredict) { | |
mapping[referenceName] = bestPredict.name | |
usedPredictNames.add(bestPredict.name) | |
} | |
} | |
} | |
// Refactored to reduce complexity by extracting helper functions | |
export function dictionaryMatch( | |
referenceNames: string[], | |
predictNames: string[], | |
mapping: Mapping, | |
): void { | |
const localIndex = buildConceptIndex() | |
const usedPredictNames = new Set(Object.values(mapping)) | |
const predictConceptIdByName = precomputePredictConceptIds(predictNames, localIndex) | |
for (const referenceName of referenceNames) { | |
if (mapping[referenceName] !== undefined) continue | |
const referenceConceptId = resolveConceptId(referenceName, localIndex) | |
if (!referenceConceptId) continue | |
const bestPredict = findBestPredict( | |
referenceName, | |
referenceConceptId, | |
predictNames, | |
predictConceptIdByName, | |
usedPredictNames, | |
) | |
if (bestPredict) { | |
mapping[referenceName] = bestPredict | |
usedPredictNames.add(bestPredict) | |
} | |
} | |
} | |
/** | |
* Precompute concept IDs for each predicted name. | |
*/ | |
function precomputePredictConceptIds( | |
predictNames: string[], | |
index: ConceptIndex, | |
): Map<string, string | null> { | |
const predictConceptIdByName = new Map<string, string | null>() | |
for (const predictName of predictNames) { | |
predictConceptIdByName.set( | |
predictName, | |
resolveConceptId(predictName, index), | |
) | |
} | |
return predictConceptIdByName | |
} | |
/** | |
* Find the best unused predicted name for a reference, matching the concept ID and maximizing token overlap. | |
*/ | |
function findBestPredict( | |
referenceName: string, | |
referenceConceptId: string, | |
predictNames: string[], | |
predictConceptIdByName: Map<string, string | null>, | |
usedPredictNames: Set<string>, | |
): string | null { | |
let bestPredict: { name: string; score: number } | null = null | |
for (const predictName of predictNames) { | |
if (usedPredictNames.has(predictName)) continue | |
const predictConceptId = predictConceptIdByName.get(predictName) | |
if (!predictConceptId || predictConceptId !== referenceConceptId) continue | |
const similarityScore = jaccard( | |
new Set(toTokens(referenceName)), | |
new Set(toTokens(predictName)), | |
) | |
if (!bestPredict || similarityScore > bestPredict.score) | |
bestPredict = { name: predictName, score: similarityScore } | |
} | |
return bestPredict ? bestPredict.name : null | |
} |
Copilot uses AI. Check for mistakes.
console.warn( | ||
'[schema-bench] No concept dictionaries loaded; dictionaryMatch will be a no-op.', | ||
) |
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using console.warn for library warnings may not be appropriate in all environments. Consider using a proper logging framework or making the warning behavior configurable.
Copilot uses AI. Check for mistakes.
console.warn( | ||
`[schema-bench] Duplicate alias "${normalizedAlias}" for concepts "${existingConceptId}" and "${concept.id}". Keeping "${existingConceptId}".`, | ||
) |
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using console.warn for duplicate alias warnings may not be appropriate in all environments. Consider using a proper logging framework or making the warning behavior configurable.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (3)
68-76
: Consider logging errors inreadJsonFile
for debuggability.The function silently returns
null
on any error, which may hide IO issues, permission problems, or malformed JSON. While the calling code handles the null case, logging the error would help troubleshoot deployment or configuration issues.function readJsonFile(filePath: string): ConceptDictionaryFile | null { try { const raw = fs.readFileSync(filePath, 'utf8') const parsed = JSON.parse(raw) return parsed - } catch { + } catch (error) { + console.warn(`[schema-bench] Failed to read dictionary from ${filePath}:`, error) return null } }
129-145
: Consider making the Jaccard threshold configurable.The 0.8 threshold is hardcoded on line 143. Different use cases may benefit from tuning this parameter. Consider accepting it as an optional parameter to
dictionaryMatch
with 0.8 as the default.function inferConceptByTokens( name: string, index: ConceptIndex, + threshold = 0.8, ): string | null { const tokens = toTokens(name) const tokenSet = new Set(tokens) let best: { id: string; score: number } | null = null for (const [id, aliases] of index.conceptToAliases) { for (const alias of aliases) { const aliasTokens = toTokens(alias) const score = jaccard(tokenSet, new Set(aliasTokens)) if (!best || score > best.score) best = { id, score } } } - const threshold = 0.8 return best && best.score >= threshold ? best.id : null }
193-205
: Optional: Extract candidate selection logic to reduce complexity.The nested loop at lines 193-205 could be extracted into a helper function to address the cognitive complexity warning and improve testability.
function findBestMatchingPredict( referenceName: string, referenceConceptId: string, predictNames: string[], usedPredictNames: Set<string>, predictConceptIdByName: Map<string, string | null>, ): string | null { let bestPredict: { name: string; score: number } | null = null for (const predictName of predictNames) { if (usedPredictNames.has(predictName)) continue const predictConceptId = predictConceptIdByName.get(predictName) if (!predictConceptId || predictConceptId !== referenceConceptId) continue const similarityScore = jaccard( new Set(toTokens(referenceName)), new Set(toTokens(predictName)), ) if (!bestPredict || similarityScore > bestPredict.score) bestPredict = { name: predictName, score: similarityScore } } return bestPredict?.name ?? null }Then simplify the main loop:
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap - let bestPredict: { name: string; score: number } | null = null - for (const predictName of predictNames) { - if (usedPredictNames.has(predictName)) continue - const predictConceptId = predictConceptIdByName.get(predictName) - if (!predictConceptId || predictConceptId !== referenceConceptId) continue - // Score by token Jaccard - const similarityScore = jaccard( - new Set(toTokens(referenceName)), - new Set(toTokens(predictName)), - ) - if (!bestPredict || similarityScore > bestPredict.score) - bestPredict = { name: predictName, score: similarityScore } - } - if (bestPredict) { - mapping[referenceName] = bestPredict.name - usedPredictNames.add(bestPredict.name) + const bestPredictName = findBestMatchingPredict( + referenceName, + referenceConceptId, + predictNames, + usedPredictNames, + predictConceptIdByName, + ) + if (bestPredictName) { + mapping[referenceName] = bestPredictName + usedPredictNames.add(bestPredictName) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
(1 hunks)frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- frontend/internal-packages/schema-bench/src/dictionaries/global.concepts.json
🧰 Additional context used
📓 Path-based instructions (4)
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
Name utility files in camelCase (e.g., mergeSchema.ts)
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use TypeScript/TSX across the codebase
**/*.{ts,tsx}
: Prefer early returns for readability
Use named exports only (no default exports)
Prefer const arrow functions over function declarations for simple utilities (e.g., const toggle = () => {})
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/internal-packages/**
📄 CodeRabbit inference engine (AGENTS.md)
Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
frontend/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Follow existing import patterns and tsconfig path aliases
Files:
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Supabase Preview
- GitHub Check: frontend-ci
- GitHub Check: frontend-lint
- GitHub Check: codeql / languages (javascript) / Perform CodeQL for javascript
- GitHub Check: Supabase Preview
🔇 Additional comments (2)
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts (2)
13-48
: LGTM! Type definitions are clear and well-documented.The type hierarchy is appropriate for the dictionary-based matching system. The bidirectional index structure in
ConceptIndex
enables efficient lookups in both directions.
106-124
: Merge aliases per concept ID instead of overwriting.Line 111 still overwrites previously registered aliases for the same concept ID, despite the past review comment being marked as addressed. When multiple dictionaries extend the same concept with additional aliases, earlier aliases are lost, breaking token-based inference.
Apply this diff to merge aliases:
const normalizedAliases = concept.aliases.map((alias) => normalizeAlias(alias), ) - conceptToAliases.set(concept.id, normalizedAliases) + const existingAliases = conceptToAliases.get(concept.id) + const mergedAliases = existingAliases + ? Array.from(new Set([...existingAliases, ...normalizedAliases])) + : normalizedAliases + conceptToAliases.set(concept.id, mergedAliases) for (const normalizedAlias of normalizedAliases) {
function inferConceptByTokens( | ||
name: string, | ||
index: ConceptIndex, | ||
): string | null { | ||
const tokens = toTokens(name) | ||
const tokenSet = new Set(tokens) | ||
let best: { id: string; score: number } | null = null | ||
for (const [id, aliases] of index.conceptToAliases) { | ||
for (const alias of aliases) { | ||
const aliasTokens = toTokens(alias) | ||
const score = jaccard(tokenSet, new Set(aliasTokens)) | ||
if (!best || score > best.score) best = { id, score } | ||
} | ||
} | ||
const threshold = 0.8 | ||
return best && best.score >= threshold ? best.id : null | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify: Is generic token filtering intentionally omitted?
The PR objectives mention "Generic token filtering to remove common tokens (info, data, table, etc.)" as a core feature, but inferConceptByTokens
uses all tokens from the input without filtering. Was this feature deferred, or should common tokens be filtered before computing Jaccard similarity?
Run this script to check if generic token filtering exists elsewhere:
🏁 Script executed:
#!/bin/bash
# Description: Search for generic token filtering implementation
# Search for filter/blacklist patterns related to common tokens
rg -n -C3 'info|data|table' --type ts -g 'schema-bench/**' | rg -i 'filter|blacklist|stopword|generic.*token'
# Search for any token filtering logic
ast-grep --pattern 'const $FILTER_TOKENS = [$$$]'
# Check if filtering is mentioned in tests
rg -n 'generic.*token|common.*token|filter.*token' --type ts -g '**/dictionaryMatch.test.ts'
Length of output: 69155
Implement generic token filtering or update PR objectives
Function inferConceptByTokens (frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts:129-145) currently includes all tokens when computing Jaccard similarity, without removing common tokens (e.g., info, data, table) as described in the PR objectives. Add filtering of these generic tokens prior to scoring, or remove “Generic token filtering” from the feature list.
🤖 Prompt for AI Agents
In
frontend/internal-packages/schema-bench/src/dictionaryMatch/dictionaryMatch.ts
around lines 129 to 145, inferConceptByTokens currently uses all tokens for
Jaccard scoring but must filter out generic tokens (e.g., "info", "data",
"table") as stated in the PR objectives; implement this by defining a small
set/array of lowercased generic tokens, filter the tokenSet and each aliasTokens
array to remove any tokens in that set (normalize to lower case and strip
empties) before computing the jaccard score, then continue the existing
scoring/threshold logic (or, if you prefer not to implement filtering, remove
"Generic token filtering" from the feature list in the PR instead).
export function dictionaryMatch( | ||
referenceNames: string[], | ||
predictNames: string[], | ||
mapping: Mapping, | ||
): void { | ||
const localIndex = buildConceptIndex() | ||
|
||
// Track used candidates to avoid duplicates | ||
const usedPredictNames = new Set(Object.values(mapping)) | ||
|
||
// Precompute candidate concepts | ||
const predictConceptIdByName = new Map<string, string | null>() | ||
for (const predictName of predictNames) { | ||
predictConceptIdByName.set( | ||
predictName, | ||
resolveConceptId(predictName, localIndex), | ||
) | ||
} | ||
|
||
for (const referenceName of referenceNames) { | ||
if (mapping[referenceName] !== undefined) continue | ||
const referenceConceptId = resolveConceptId(referenceName, localIndex) | ||
if (!referenceConceptId) continue | ||
|
||
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap | ||
let bestPredict: { name: string; score: number } | null = null | ||
for (const predictName of predictNames) { | ||
if (usedPredictNames.has(predictName)) continue | ||
const predictConceptId = predictConceptIdByName.get(predictName) | ||
if (!predictConceptId || predictConceptId !== referenceConceptId) continue | ||
// Score by token Jaccard | ||
const similarityScore = jaccard( | ||
new Set(toTokens(referenceName)), | ||
new Set(toTokens(predictName)), | ||
) | ||
if (!bestPredict || similarityScore > bestPredict.score) | ||
bestPredict = { name: predictName, score: similarityScore } | ||
} | ||
if (bestPredict) { | ||
mapping[referenceName] = bestPredict.name | ||
usedPredictNames.add(bestPredict.name) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimize: Cache or inject the concept index to avoid rebuilding.
Line 173 calls buildConceptIndex()
on every invocation of dictionaryMatch
, which rereads and reparses the dictionary file and rebuilds the maps. If this function is called multiple times during evaluation (e.g., for tables and columns separately), this becomes a significant performance bottleneck.
Consider one of these approaches:
Option 1: Module-level cache (simplest)
+let cachedIndex: ConceptIndex | null = null
+
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
): void {
- const localIndex = buildConceptIndex()
+ if (!cachedIndex) {
+ cachedIndex = buildConceptIndex()
+ }
+ const localIndex = cachedIndex
Option 2: Accept index as parameter (more flexible)
export function dictionaryMatch(
referenceNames: string[],
predictNames: string[],
mapping: Mapping,
+ index?: ConceptIndex,
): void {
- const localIndex = buildConceptIndex()
+ const localIndex = index ?? buildConceptIndex()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
export function dictionaryMatch( | |
referenceNames: string[], | |
predictNames: string[], | |
mapping: Mapping, | |
): void { | |
const localIndex = buildConceptIndex() | |
// Track used candidates to avoid duplicates | |
const usedPredictNames = new Set(Object.values(mapping)) | |
// Precompute candidate concepts | |
const predictConceptIdByName = new Map<string, string | null>() | |
for (const predictName of predictNames) { | |
predictConceptIdByName.set( | |
predictName, | |
resolveConceptId(predictName, localIndex), | |
) | |
} | |
for (const referenceName of referenceNames) { | |
if (mapping[referenceName] !== undefined) continue | |
const referenceConceptId = resolveConceptId(referenceName, localIndex) | |
if (!referenceConceptId) continue | |
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap | |
let bestPredict: { name: string; score: number } | null = null | |
for (const predictName of predictNames) { | |
if (usedPredictNames.has(predictName)) continue | |
const predictConceptId = predictConceptIdByName.get(predictName) | |
if (!predictConceptId || predictConceptId !== referenceConceptId) continue | |
// Score by token Jaccard | |
const similarityScore = jaccard( | |
new Set(toTokens(referenceName)), | |
new Set(toTokens(predictName)), | |
) | |
if (!bestPredict || similarityScore > bestPredict.score) | |
bestPredict = { name: predictName, score: similarityScore } | |
} | |
if (bestPredict) { | |
mapping[referenceName] = bestPredict.name | |
usedPredictNames.add(bestPredict.name) | |
} | |
} | |
} | |
// Add a one-time cache for the concept index | |
let cachedIndex: ConceptIndex | null = null | |
export function dictionaryMatch( | |
referenceNames: string[], | |
predictNames: string[], | |
mapping: Mapping, | |
): void { | |
// Reuse the cached index if available | |
if (!cachedIndex) { | |
cachedIndex = buildConceptIndex() | |
} | |
const localIndex = cachedIndex | |
// Track used candidates to avoid duplicates | |
const usedPredictNames = new Set(Object.values(mapping)) | |
// Precompute candidate concepts | |
const predictConceptIdByName = new Map<string, string | null>() | |
for (const predictName of predictNames) { | |
predictConceptIdByName.set( | |
predictName, | |
resolveConceptId(predictName, localIndex), | |
) | |
} | |
for (const referenceName of referenceNames) { | |
if (mapping[referenceName] !== undefined) continue | |
const referenceConceptId = resolveConceptId(referenceName, localIndex) | |
if (!referenceConceptId) continue | |
// Choose the first unused candidate with the same concept; if multiple, prefer higher token overlap | |
let bestPredict: { name: string; score: number } | null = null | |
for (const predictName of predictNames) { | |
if (usedPredictNames.has(predictName)) continue | |
const predictConceptId = predictConceptIdByName.get(predictName) | |
if (!predictConceptId || predictConceptId !== referenceConceptId) continue | |
// Score by token Jaccard | |
const similarityScore = jaccard( | |
new Set(toTokens(referenceName)), | |
new Set(toTokens(predictName)), | |
) | |
if (!bestPredict || similarityScore > bestPredict.score) | |
bestPredict = { name: predictName, score: similarityScore } | |
} | |
if (bestPredict) { | |
mapping[referenceName] = bestPredict.name | |
usedPredictNames.add(bestPredict.name) | |
} | |
} | |
} |
import { describe, expect, it } from 'vitest' | ||
import { dictionaryMatch } from './dictionaryMatch' | ||
|
||
describe('dictionaryMatch (concept-based)', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@NoritakaIkeda Sorry for late 🙏🏻 I'm reviewing now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -1,3 +1,4 @@ | |||
export * from './dictionaryMatch' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this export necessary?
): Promise<Mapping> => { | ||
const columnMapping: Mapping = {} | ||
|
||
// NOTE: Implement synonym matching if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should have been implemented this time. Shouldn't you delete it?
): Promise<Mapping> => { | ||
const tableMapping: Mapping = {} | ||
|
||
// NOTE: Implement synonym matching if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should have been implemented this time. Shouldn't you delete it?
* - References with no resolvable concept remain unmapped. | ||
*/ | ||
|
||
// biome-ignore lint/complexity/noExcessiveCognitiveComplexity: TODO: Refactor to reduce complexity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd prefer not to have any new code created that violates the rules, but this time it's acceptable as is.
#3637 (comment) I want to fix these three immediately, but since I need to merge this PR urgently, I'll merge it now. I'll address them in the next PR. |
Issue
Why is this change needed?
This PR adds dictionary-based concept matching to the schema benchmarking evaluation system to improve accuracy when comparing predicted and reference schemas.
Without Dictionary
With Dictionary
What this PR includes:
Core Features
Technical Implementation
dictionaryMatch()
function as the first matching strategy in the evaluation pipelineFiles Added
src/dictionaryMatch/dictionaryMatch.ts
- Core dictionary matching implementationsrc/dictionaryMatch/dictionaryMatch.test.ts
- Comprehensive test coveragesrc/dictionaries/global.concepts.json
- Default concept dictionary with common patternsTest Coverage
This enhancement improves schema evaluation accuracy by recognizing semantically equivalent table/column names that may use different naming conventions but represent the same concepts.
Summary by CodeRabbit
New Features
Tests