Reproducible Hindi-English sound-matching benchmark and retrieval baselines.
Find English words that sound similar to Hindi words written in Devanagari. The package includes:
- A curated benchmark of 2,959 Hindi-English pairs from Xlit-Crowd
- Three retrieval methods: orthographic, phonetic, and hybrid
- CLI tools and Python API for evaluation and querying
pip install bolingualOr with uv:
uv add bolingualTo use the interactive web app:
pip install bolingual[app]Then run:
bolingual-appOr directly with streamlit:
streamlit run bolingual/app.pyEvaluation on held-out test set (740 items, top_k=200, alpha=0.5):
| Method | Top-1 | Top-5 | Top-10 | MRR |
|---|---|---|---|---|
| Orthographic | 53.1% | 70.1% | 74.3% | 0.609 |
| Phonetic | 61.9% | 79.1% | 82.3% | 0.694 |
| Hybrid | 70.3% | 82.0% | 84.2% | 0.757 |
On "hard" examples where orthographic keys don't match (216 items):
| Method | Top-1 | Top-5 | MRR |
|---|---|---|---|
| Orthographic | 21.8% | 42.1% | 0.318 |
| Phonetic | 52.3% | 74.1% | 0.615 |
| Hybrid | 66.7% | 80.1% | 0.725 |
| Hindi | Gold | Orthographic Rank | Hybrid Rank |
|---|---|---|---|
| सर्किट्स | circuits | 164 | 1 |
| न्यू | new | 122 | 1 |
| थैचर | thatcher | 116 | 1 |
| गेट | gate | 90 | 1 |
| फ्रॉस्ट | frost | 87 | 1 |
from bolingual import CandidateIndex
# Build index from benchmark vocabulary
import pandas as pd
benchmark = pd.read_csv("data/processed/benchmark.csv")
index = CandidateIndex.from_words(benchmark["gold"].unique())
# Query with Hindi word
rankings = index.hybrid_ranking("वॉटसन", top_k=200)
for score, word in rankings["hybrid"][:5]:
print(f"{word}: {score:.3f}")Build benchmark:
bolingual-build-benchmark --raw data/raw/crowd_transliterations.hi-en.txt --output benchmark.csvRun experiments:
bolingual-run-experiment --benchmark benchmark.csv --output-dir results/Query interactively:
bolingual-query --hindi "वॉटसन" --show 10- Orthographic: Romanize Hindi → normalize to coarse keys → rank with Jaro-Winkler similarity
- Phonetic: Rerank orthographic shortlist using weighted edit distance over Hindi pronunciation variants and CMUdict pronunciations
- Hybrid: Combine orthographic and phonetic scores (α=0.5 weighting)
Key features:
- Hindi schwa deletion handling (multiple pronunciation variants)
- Phonetic similarity groups for soft consonant/vowel matching
- CMUdict-backed English pronunciations
- Source: Xlit-Crowd Hindi-English transliteration data
- Size: 2,959 items (2,219 dev / 740 test)
- Vocabulary: 2,673 unique English words
- Construction: Deterministic pipeline with CMUdict filtering
git clone https://github.com/soodoku/bolingual
cd bolingual
uv sync --all-extras
uv run pytest
uv run ruff check src testsMIT