1- # Indicate: Transliterate Indic Languages to English
1+ # Indicate: Transliterate Indic Languages with TensorFlow and LLMs
22
33[ ![ Notary Badge] ( https://notarypy.soodoku.workers.dev/badge/indicate/0.2.1/indicate-0.2.1-py3-none-any.whl )] ( https://pypi.org/integrity/indicate/0.2.1/indicate-0.2.1-py3-none-any.whl/provenance )
44[ ![ PyPI Version] ( https://img.shields.io/pypi/v/indicate.svg )] ( https://pypi.python.org/pypi/indicate )
55[ ![ Downloads] ( https://static.pepy.tech/badge/indicate )] ( https://pepy.tech/project/indicate )
66[ ![ Tests] ( https://github.com/in-rolls/indicate/workflows/test/badge.svg )] ( https://github.com/in-rolls/indicate/actions?query=workflow%3Atest )
77[ ![ Documentation] ( https://img.shields.io/badge/docs-github.io-blue )] ( https://in-rolls.github.io/indicate/ )
88
9- Transliterations to/from Indian languages are still generally low quality. One problem is access to data. Another is that there is no standard transliteration .
9+ ** Indicate ** provides high-quality transliteration between Indic languages and English using both traditional TensorFlow models and state-of-the-art LLMs (Large Language Models) .
1010
11- For Hindi--English, we build novel dataset for names using the ESPNcricinfo. For instance, see [ here ] ( https://www.espncricinfo.com/hindi/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard ) for hindi version of the [ english scorecard ] ( https://www.espncricinfo.com/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard ) .
11+ ## 🚀 Features
1212
13- We also create a dataset from [ election affidavits] ( https://affidavit.eci.gov.in/CandidateCustomFilter ) and exploit the [ Google Dakshina dataset] ( https://github.com/google-research-datasets/dakshina ) .
13+ - ** 🧠 Dual Backend Support** : Choose between TensorFlow models or LLM-based transliteration
14+ - ** 🌍 Multi-Language** : 12+ Indic languages (Hindi, Tamil, Telugu, Bengali, etc.)
15+ - ** 🔄 Bidirectional** : Supports both Indic→English and English→Indic transliteration
16+ - ** 🛡️ Production Ready** : Safe file handling, atomic writes, backup support
17+ - ** 📊 Structured Output** : Rich JSON format with metadata and error handling
18+ - ** ⚡ Batch Processing** : Efficient processing of large files with progress tracking
1419
15- To overcome the fact that there isn't one standard way of transliteration, we provide k-best transliterations.
20+ ## 🎯 Supported Languages
21+
22+ Hindi • Tamil • Telugu • Bengali • Gujarati • Kannada • Malayalam • Punjabi • Marathi • Odia • Urdu • Sanskrit ↔ English
1623
1724## Install
1825
@@ -24,99 +31,179 @@ We strongly recommend installing `indicate` inside a Python virtual environment
2431pip install indicate
2532```
2633
27- ## Usage
34+ ## 🔧 Quick Setup
2835
29- ### Python API
36+ ### For LLM-based transliteration (recommended):
37+ ``` bash
38+ pip install indicate
3039
31- ``` python
32- from indicate import transliterate
33- english_translated = transliterate.hindi2english(" हिंदी" )
34- print (english_translated)
35- # Output: hindi
40+ # Set your API key (choose one):
41+ export OPENAI_API_KEY=your-key
42+ export ANTHROPIC_API_KEY=your-key
43+ export GOOGLE_API_KEY=your-key
44+ ```
45+
46+ ### For TensorFlow-only usage:
47+ ``` bash
48+ pip install indicate
49+ # No API key needed - uses pre-trained models
3650```
3751
38- ### Command Line Interface
52+ ## 🎯 Usage
3953
40- The package provides both modern and legacy CLI interfaces:
54+ ### 🧠 LLM-Based Transliteration (New!)
4155
42- #### Modern CLI (Recommended)
56+ The LLM backend provides higher accuracy and supports all Indic languages:
4357
4458``` bash
45- # Basic usage
59+ # Simple transliteration (auto-detects Hindi)
60+ indicate llm " राजशेखर चिंतालपति"
61+ # Output: Rajashekar Chintalapati
62+
63+ # Specify languages explicitly
64+ indicate llm " முருகன்" --source tamil --target english
65+ # Output: Murugan
66+
67+ # Between Indic languages
68+ indicate llm " नमस्ते" --source hindi --target tamil
69+ # Output: நமஸ்தே
70+
71+ # Safe batch processing with structured JSON output
72+ indicate llm --input names.txt --output results.json --format json --batch --backup
73+
74+ # Dry run to preview changes
75+ indicate llm --input large_file.txt --dry-run
76+ ```
77+
78+ ** Python API:**
79+ ``` python
80+ from indicate import IndicLLMTransliterator
81+
82+ # Initialize for any language pair
83+ transliterator = IndicLLMTransliterator(' hindi' , ' english' )
84+ result = transliterator.transliterate(' राजशेखर चिंतालपति' )
85+ print (result) # Output: Rajashekar Chintalapati
86+
87+ # Batch processing
88+ texts = [" राजेश" , " गौरव" , " प्रिया" ]
89+ results = transliterator.transliterate_batch(texts)
90+ print (results) # ['Rajesh', 'Gaurav', 'Priya']
91+ ```
92+
93+ ### 🤖 TensorFlow Backend (Traditional)
94+
95+ ``` bash
96+ # Hindi to English using TensorFlow model
4697indicate hindi2english " राजशेखर चिंतालपति"
98+ # Output: rajashekar chintalapati
4799
48100# From file
49101indicate hindi2english --input hindi.txt --output english.txt
50102
51- # From stdin
52- echo " गौरव सूद" | indicate hindi2english
53-
54- # Batch processing for large files
55- indicate hindi2english --input large_file.txt --batch --quiet
103+ # Batch processing
104+ indicate hindi2english --input large_file.txt --batch
105+ ```
56106
57- # Get help
58- indicate hindi2english --help
107+ ** Python API:**
108+ ``` python
109+ from indicate import hindi2english
110+ result = hindi2english(" हिंदी" )
111+ print (result) # Output: hindi
112+ ```
59113
60- # Package information
61- indicate info
114+ ## 📊 JSON Output Format
115+
116+ The LLM backend provides rich, structured output perfect for data processing:
117+
118+ ``` json
119+ {
120+ "metadata" : {
121+ "source_language" : " hindi" ,
122+ "target_language" : " english" ,
123+ "timestamp" : " 2024-12-09T12:00:00Z" ,
124+ "total_lines" : 3 ,
125+ "successful_lines" : 3 ,
126+ "failed_lines" : 0 ,
127+ "encoding" : " utf-8"
128+ },
129+ "results" : [
130+ {
131+ "line_number" : 1 ,
132+ "input_text" : " राजेश कुमार" ,
133+ "output_text" : " Rajesh Kumar" ,
134+ "source_lang" : " hindi" ,
135+ "target_lang" : " english" ,
136+ "confidence" : " high" ,
137+ "processing_time" : 1.2 ,
138+ "timestamp" : " 2024-12-09T12:00:01Z"
139+ }
140+ ]
141+ }
62142```
63143
64- #### Legacy CLI (Backward Compatibility)
144+ ## 🛡️ Safety Features
145+
146+ - ** 🔒 Input/Output Validation** : Prevents accidental file overwrites
147+ - ** ⚛️ Atomic Writing** : Safe file operations using temporary files
148+ - ** 💾 Automatic Backups** : Optional timestamped backups of existing files
149+ - ** 🔄 Resume Support** : Resume interrupted batch operations
150+ - ** 👁️ Dry Run Mode** : Preview operations before execution
151+
152+ ## 🎛️ Advanced Usage
65153
66154``` bash
67- # Still supported for backward compatibility
68- hindi2english --type hin2eng --input " हिंदी"
69- ```
155+ # Show few-shot examples being used
156+ indicate llm --show-examples --source bengali --target english
70157
71- ## Functions
158+ # Resume interrupted batch job
159+ indicate llm --input large_file.txt --output results.txt --resume
72160
73- We expose 1 function, which will take Hindi text and transliterate it to English.
161+ # Use specific LLM provider/model
162+ indicate llm " text" --provider anthropic --model claude-3-opus
74163
75- - ** transliterate.hindi2english(input)**
76- - What it does: Converts given hindi text into English alphabet
77- - Output: Returns text in English
164+ # Process JSON from previous results
165+ indicate llm --input results.json --source english --target hindi
166+ ```
167+
168+ ## 🔄 Backend Comparison
78169
79- ## Testing Locally
170+ | Feature | TensorFlow Backend | LLM Backend |
171+ | ---------| ------------------| -------------|
172+ | ** Languages** | Hindi ↔ English only | 12+ Indic languages ↔ English + Inter-Indic |
173+ | ** Setup** | No API key needed | Requires LLM API key |
174+ | ** Speed** | Very fast (local) | Moderate (API calls) |
175+ | ** Accuracy** | Good for common words | Excellent for all types |
176+ | ** Cost** | Free | Pay per API call |
177+ | ** Offline** | ✅ Works offline | ❌ Requires internet |
178+ | ** Batch Processing** | ✅ | ✅ with safety features |
80179
81- To test the package locally, follow these steps:
180+ ## 🧪 Testing Locally
82181
83- 1 . ** Clone the repository ** :
182+ 1 . ** Clone and install ** :
84183 ``` bash
85184 git clone https://github.com/in-rolls/indicate.git
86185 cd indicate
186+ uv sync # or pip install -e .
87187 ```
88188
89- 2 . ** Install with uv (recommended) ** :
189+ 2 . ** Run tests ** :
90190 ``` bash
91- uv sync
92- ```
191+ # All tests
192+ python -m pytest
93193
94- Or with pip:
95- ``` bash
96- python -m venv venv
97- source venv/bin/activate # On Windows: venv\Scripts\activate
98- pip install -e .
194+ # Specific tests
195+ python -m pytest tests/test_llm_indic.py
196+ python -m pytest tests/test_file_safety.py
99197 ```
100198
101- 3 . ** Run tests ** :
199+ 3 . ** Test both backends ** :
102200 ``` bash
103- # Run all tests
104- python -m unittest discover tests/
105-
106- # Run specific test
107- python -m unittest tests.test_010_hindi_translate
108- ```
109-
110- 4 . ** Test the transliteration** :
111- ``` bash
112- # Modern CLI
201+ # TensorFlow backend
113202 indicate hindi2english " हिंदी"
114203
115- # Legacy CLI
116- hindi2english --type hin2eng --input " हिंदी"
117-
118- # Python usage
119- python -c " from indicate import transliterate; print(transliterate.hindi2english('हिंदी'))"
204+ # LLM backend (set API key first)
205+ export OPENAI_API_KEY=your-key
206+ indicate llm " हिंदी"
120207 ```
121208
122209## Data
0 commit comments