Skip to content

Commit 24ecd7a

Browse files
committed
let's add llm based transliteration + care in schema, overwriting
1 parent 4fba09b commit 24ecd7a

25 files changed

+5111
-367
lines changed

README.md

Lines changed: 148 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,25 @@
1-
# Indicate: Transliterate Indic Languages to English
1+
# Indicate: Transliterate Indic Languages with TensorFlow and LLMs
22

33
[![Notary Badge](https://notarypy.soodoku.workers.dev/badge/indicate/0.2.1/indicate-0.2.1-py3-none-any.whl)](https://pypi.org/integrity/indicate/0.2.1/indicate-0.2.1-py3-none-any.whl/provenance)
44
[![PyPI Version](https://img.shields.io/pypi/v/indicate.svg)](https://pypi.python.org/pypi/indicate)
55
[![Downloads](https://static.pepy.tech/badge/indicate)](https://pepy.tech/project/indicate)
66
[![Tests](https://github.com/in-rolls/indicate/workflows/test/badge.svg)](https://github.com/in-rolls/indicate/actions?query=workflow%3Atest)
77
[![Documentation](https://img.shields.io/badge/docs-github.io-blue)](https://in-rolls.github.io/indicate/)
88

9-
Transliterations to/from Indian languages are still generally low quality. One problem is access to data. Another is that there is no standard transliteration.
9+
**Indicate** provides high-quality transliteration between Indic languages and English using both traditional TensorFlow models and state-of-the-art LLMs (Large Language Models).
1010

11-
For Hindi--English, we build novel dataset for names using the ESPNcricinfo. For instance, see [here](https://www.espncricinfo.com/hindi/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard) for hindi version of the [english scorecard](https://www.espncricinfo.com/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard).
11+
## 🚀 Features
1212

13-
We also create a dataset from [election affidavits](https://affidavit.eci.gov.in/CandidateCustomFilter) and exploit the [Google Dakshina dataset](https://github.com/google-research-datasets/dakshina).
13+
- **🧠 Dual Backend Support**: Choose between TensorFlow models or LLM-based transliteration
14+
- **🌍 Multi-Language**: 12+ Indic languages (Hindi, Tamil, Telugu, Bengali, etc.)
15+
- **🔄 Bidirectional**: Supports both Indic→English and English→Indic transliteration
16+
- **🛡️ Production Ready**: Safe file handling, atomic writes, backup support
17+
- **📊 Structured Output**: Rich JSON format with metadata and error handling
18+
- **⚡ Batch Processing**: Efficient processing of large files with progress tracking
1419

15-
To overcome the fact that there isn't one standard way of transliteration, we provide k-best transliterations.
20+
## 🎯 Supported Languages
21+
22+
Hindi • Tamil • Telugu • Bengali • Gujarati • Kannada • Malayalam • Punjabi • Marathi • Odia • Urdu • Sanskrit ↔ English
1623

1724
## Install
1825

@@ -24,99 +31,179 @@ We strongly recommend installing `indicate` inside a Python virtual environment
2431
pip install indicate
2532
```
2633

27-
## Usage
34+
## 🔧 Quick Setup
2835

29-
### Python API
36+
### For LLM-based transliteration (recommended):
37+
```bash
38+
pip install indicate
3039

31-
```python
32-
from indicate import transliterate
33-
english_translated = transliterate.hindi2english("हिंदी")
34-
print(english_translated)
35-
# Output: hindi
40+
# Set your API key (choose one):
41+
export OPENAI_API_KEY=your-key
42+
export ANTHROPIC_API_KEY=your-key
43+
export GOOGLE_API_KEY=your-key
44+
```
45+
46+
### For TensorFlow-only usage:
47+
```bash
48+
pip install indicate
49+
# No API key needed - uses pre-trained models
3650
```
3751

38-
### Command Line Interface
52+
## 🎯 Usage
3953

40-
The package provides both modern and legacy CLI interfaces:
54+
### 🧠 LLM-Based Transliteration (New!)
4155

42-
#### Modern CLI (Recommended)
56+
The LLM backend provides higher accuracy and supports all Indic languages:
4357

4458
```bash
45-
# Basic usage
59+
# Simple transliteration (auto-detects Hindi)
60+
indicate llm "राजशेखर चिंतालपति"
61+
# Output: Rajashekar Chintalapati
62+
63+
# Specify languages explicitly
64+
indicate llm "முருகன்" --source tamil --target english
65+
# Output: Murugan
66+
67+
# Between Indic languages
68+
indicate llm "नमस्ते" --source hindi --target tamil
69+
# Output: நமஸ்தே
70+
71+
# Safe batch processing with structured JSON output
72+
indicate llm --input names.txt --output results.json --format json --batch --backup
73+
74+
# Dry run to preview changes
75+
indicate llm --input large_file.txt --dry-run
76+
```
77+
78+
**Python API:**
79+
```python
80+
from indicate import IndicLLMTransliterator
81+
82+
# Initialize for any language pair
83+
transliterator = IndicLLMTransliterator('hindi', 'english')
84+
result = transliterator.transliterate('राजशेखर चिंतालपति')
85+
print(result) # Output: Rajashekar Chintalapati
86+
87+
# Batch processing
88+
texts = ["राजेश", "गौरव", "प्रिया"]
89+
results = transliterator.transliterate_batch(texts)
90+
print(results) # ['Rajesh', 'Gaurav', 'Priya']
91+
```
92+
93+
### 🤖 TensorFlow Backend (Traditional)
94+
95+
```bash
96+
# Hindi to English using TensorFlow model
4697
indicate hindi2english "राजशेखर चिंतालपति"
98+
# Output: rajashekar chintalapati
4799

48100
# From file
49101
indicate hindi2english --input hindi.txt --output english.txt
50102

51-
# From stdin
52-
echo "गौरव सूद" | indicate hindi2english
53-
54-
# Batch processing for large files
55-
indicate hindi2english --input large_file.txt --batch --quiet
103+
# Batch processing
104+
indicate hindi2english --input large_file.txt --batch
105+
```
56106

57-
# Get help
58-
indicate hindi2english --help
107+
**Python API:**
108+
```python
109+
from indicate import hindi2english
110+
result = hindi2english("हिंदी")
111+
print(result) # Output: hindi
112+
```
59113

60-
# Package information
61-
indicate info
114+
## 📊 JSON Output Format
115+
116+
The LLM backend provides rich, structured output perfect for data processing:
117+
118+
```json
119+
{
120+
"metadata": {
121+
"source_language": "hindi",
122+
"target_language": "english",
123+
"timestamp": "2024-12-09T12:00:00Z",
124+
"total_lines": 3,
125+
"successful_lines": 3,
126+
"failed_lines": 0,
127+
"encoding": "utf-8"
128+
},
129+
"results": [
130+
{
131+
"line_number": 1,
132+
"input_text": "राजेश कुमार",
133+
"output_text": "Rajesh Kumar",
134+
"source_lang": "hindi",
135+
"target_lang": "english",
136+
"confidence": "high",
137+
"processing_time": 1.2,
138+
"timestamp": "2024-12-09T12:00:01Z"
139+
}
140+
]
141+
}
62142
```
63143

64-
#### Legacy CLI (Backward Compatibility)
144+
## 🛡️ Safety Features
145+
146+
- **🔒 Input/Output Validation**: Prevents accidental file overwrites
147+
- **⚛️ Atomic Writing**: Safe file operations using temporary files
148+
- **💾 Automatic Backups**: Optional timestamped backups of existing files
149+
- **🔄 Resume Support**: Resume interrupted batch operations
150+
- **👁️ Dry Run Mode**: Preview operations before execution
151+
152+
## 🎛️ Advanced Usage
65153

66154
```bash
67-
# Still supported for backward compatibility
68-
hindi2english --type hin2eng --input "हिंदी"
69-
```
155+
# Show few-shot examples being used
156+
indicate llm --show-examples --source bengali --target english
70157

71-
## Functions
158+
# Resume interrupted batch job
159+
indicate llm --input large_file.txt --output results.txt --resume
72160

73-
We expose 1 function, which will take Hindi text and transliterate it to English.
161+
# Use specific LLM provider/model
162+
indicate llm "text" --provider anthropic --model claude-3-opus
74163

75-
- **transliterate.hindi2english(input)**
76-
- What it does: Converts given hindi text into English alphabet
77-
- Output: Returns text in English
164+
# Process JSON from previous results
165+
indicate llm --input results.json --source english --target hindi
166+
```
167+
168+
## 🔄 Backend Comparison
78169

79-
## Testing Locally
170+
| Feature | TensorFlow Backend | LLM Backend |
171+
|---------|------------------|-------------|
172+
| **Languages** | Hindi ↔ English only | 12+ Indic languages ↔ English + Inter-Indic |
173+
| **Setup** | No API key needed | Requires LLM API key |
174+
| **Speed** | Very fast (local) | Moderate (API calls) |
175+
| **Accuracy** | Good for common words | Excellent for all types |
176+
| **Cost** | Free | Pay per API call |
177+
| **Offline** | ✅ Works offline | ❌ Requires internet |
178+
| **Batch Processing** || ✅ with safety features |
80179

81-
To test the package locally, follow these steps:
180+
## 🧪 Testing Locally
82181

83-
1. **Clone the repository**:
182+
1. **Clone and install**:
84183
```bash
85184
git clone https://github.com/in-rolls/indicate.git
86185
cd indicate
186+
uv sync # or pip install -e .
87187
```
88188

89-
2. **Install with uv (recommended)**:
189+
2. **Run tests**:
90190
```bash
91-
uv sync
92-
```
191+
# All tests
192+
python -m pytest
93193

94-
Or with pip:
95-
```bash
96-
python -m venv venv
97-
source venv/bin/activate # On Windows: venv\Scripts\activate
98-
pip install -e .
194+
# Specific tests
195+
python -m pytest tests/test_llm_indic.py
196+
python -m pytest tests/test_file_safety.py
99197
```
100198

101-
3. **Run tests**:
199+
3. **Test both backends**:
102200
```bash
103-
# Run all tests
104-
python -m unittest discover tests/
105-
106-
# Run specific test
107-
python -m unittest tests.test_010_hindi_translate
108-
```
109-
110-
4. **Test the transliteration**:
111-
```bash
112-
# Modern CLI
201+
# TensorFlow backend
113202
indicate hindi2english "हिंदी"
114203

115-
# Legacy CLI
116-
hindi2english --type hin2eng --input "हिंदी"
117-
118-
# Python usage
119-
python -c "from indicate import transliterate; print(transliterate.hindi2english('हिंदी'))"
204+
# LLM backend (set API key first)
205+
export OPENAI_API_KEY=your-key
206+
indicate llm "हिंदी"
120207
```
121208

122209
## Data

0 commit comments

Comments
 (0)