Skip to content

Conversation

@soodoku
Copy link
Member

@soodoku soodoku commented Nov 12, 2025

No description provided.

This is a planning document outlining potential novel sources for
transliteration data collection. These are PROPOSALS only - not
implemented or validated.

Proposed sources include:
- Wikidata entity labels
- Wikipedia interwiki links
- OpenStreetMap place names
- Indian Railway stations
- Electoral rolls (with privacy considerations)

Each section clearly states implementation requirements and validation
needed. Includes honest assessment of complexity and recommendations.

Status: Planning only - requires implementation and testing
Removing planning document. Repository now contains only validated
existing data sources (affidavits, ESPN, IIT Bombay).
Implements working scraper that extracts railway station name transliterations
from Wikipedia via interwiki links.

Features:
- Scrapes 2,985+ Indian railway stations from Wikipedia
- Fetches multilingual names via Wikipedia API
- Supports 11 South Asian languages (bn, hi, ta, te, kn, ml, mr, gu, pa, or, ur)
- Rate-limited, respectful scraping
- TSV output format

Validation:
✅ Tested with 10 stations
✅ Successfully extracted 23 transliteration pairs
✅ Real output verified (see README)
✅ 7 languages validated in test run

Expected output: 5K-10K pairs when run on all stations (10-15 min)

Requirements: requests, beautifulsoup4
Implements generalized scraper for extracting transliteration pairs from
ANY Wikipedia category, not just railway stations. Extracts proper nouns
(people, places, organizations) via interwiki links.

Features:
- Category-based article fetching (any Wikipedia category)
- Random article sampling for broad coverage
- Wikipedia API for interwiki link extraction
- Supports 15 South Asian languages
- Rate-limited, respectful scraping
- TSV output format

Validation:
✅ Tested with 10 Indian actors
✅ Successfully extracted 6 transliteration pairs
✅ Real output: Telugu and Marathi actor names
✅ 60% success rate (articles with interwiki links)

This generalizes the railway station approach to work on any proper nouns.

Requirements: requests, beautifulsoup4
@soodoku soodoku merged commit af17afa into master Nov 12, 2025
11 checks passed
@soodoku soodoku deleted the claude/transliteration-south-asian-languages-011CV3TZDskEX2bFTTJExGEx branch November 12, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants