|
| 1 | +# Indicate: Transliterate Indic Languages to English |
| 2 | + |
| 3 | +[](https://pypi.org/integrity/indicate/0.2.1/indicate-0.2.1-py3-none-any.whl/provenance) |
| 4 | +[](https://pypi.python.org/pypi/indicate) |
| 5 | +[](https://pepy.tech/project/indicate) |
| 6 | +[](https://github.com/in-rolls/indicate/actions?query=workflow%3Atest) |
| 7 | +[](https://in-rolls.github.io/indicate/) |
| 8 | + |
| 9 | +Transliterations to/from Indian languages are still generally low quality. One problem is access to data. Another is that there is no standard transliteration. |
| 10 | + |
| 11 | +For Hindi--English, we build novel dataset for names using the ESPNcricinfo. For instance, see [here](https://www.espncricinfo.com/hindi/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard) for hindi version of the [english scorecard](https://www.espncricinfo.com/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard). |
| 12 | + |
| 13 | +We also create a dataset from [election affidavits](https://affidavit.eci.gov.in/CandidateCustomFilter) and exploit the [Google Dakshina dataset](https://github.com/google-research-datasets/dakshina). |
| 14 | + |
| 15 | +To overcome the fact that there isn't one standard way of transliteration, we provide k-best transliterations. |
| 16 | + |
| 17 | +## Install |
| 18 | + |
| 19 | +We strongly recommend installing `indicate` inside a Python virtual environment (see [venv documentation](https://docs.python.org/3/library/venv.html#creating-virtual-environments)) |
| 20 | + |
| 21 | +**Requirements:** Python 3.10 or higher |
| 22 | + |
| 23 | +```bash |
| 24 | +pip install indicate |
| 25 | +``` |
| 26 | + |
| 27 | +## Usage |
| 28 | + |
| 29 | +### Python API |
| 30 | + |
| 31 | +```python |
| 32 | +from indicate import transliterate |
| 33 | +english_translated = transliterate.hindi2english("हिंदी") |
| 34 | +print(english_translated) |
| 35 | +# Output: hindi |
| 36 | +``` |
| 37 | + |
| 38 | +### Command Line Interface |
| 39 | + |
| 40 | +The package provides both modern and legacy CLI interfaces: |
| 41 | + |
| 42 | +#### Modern CLI (Recommended) |
| 43 | + |
| 44 | +```bash |
| 45 | +# Basic usage |
| 46 | +indicate hindi2english "राजशेखर चिंतालपति" |
| 47 | + |
| 48 | +# From file |
| 49 | +indicate hindi2english --input hindi.txt --output english.txt |
| 50 | + |
| 51 | +# From stdin |
| 52 | +echo "गौरव सूद" | indicate hindi2english |
| 53 | + |
| 54 | +# Batch processing for large files |
| 55 | +indicate hindi2english --input large_file.txt --batch --quiet |
| 56 | + |
| 57 | +# Get help |
| 58 | +indicate hindi2english --help |
| 59 | + |
| 60 | +# Package information |
| 61 | +indicate info |
| 62 | +``` |
| 63 | + |
| 64 | +#### Legacy CLI (Backward Compatibility) |
| 65 | + |
| 66 | +```bash |
| 67 | +# Still supported for backward compatibility |
| 68 | +hindi2english --type hin2eng --input "हिंदी" |
| 69 | +``` |
| 70 | + |
| 71 | +## Functions |
| 72 | + |
| 73 | +We expose 1 function, which will take Hindi text and transliterate it to English. |
| 74 | + |
| 75 | +- **transliterate.hindi2english(input)** |
| 76 | + - What it does: Converts given hindi text into English alphabet |
| 77 | + - Output: Returns text in English |
| 78 | + |
| 79 | +## Testing Locally |
| 80 | + |
| 81 | +To test the package locally, follow these steps: |
| 82 | + |
| 83 | +1. **Clone the repository**: |
| 84 | + ```bash |
| 85 | + git clone https://github.com/in-rolls/indicate.git |
| 86 | + cd indicate |
| 87 | + ``` |
| 88 | + |
| 89 | +2. **Install with uv (recommended)**: |
| 90 | + ```bash |
| 91 | + uv sync |
| 92 | + ``` |
| 93 | + |
| 94 | + Or with pip: |
| 95 | + ```bash |
| 96 | + python -m venv venv |
| 97 | + source venv/bin/activate # On Windows: venv\Scripts\activate |
| 98 | + pip install -e . |
| 99 | + ``` |
| 100 | + |
| 101 | +3. **Run tests**: |
| 102 | + ```bash |
| 103 | + # Run all tests |
| 104 | + python -m unittest discover tests/ |
| 105 | + |
| 106 | + # Run specific test |
| 107 | + python -m unittest tests.test_010_hindi_translate |
| 108 | + ``` |
| 109 | + |
| 110 | +4. **Test the transliteration**: |
| 111 | + ```bash |
| 112 | + # Modern CLI |
| 113 | + indicate hindi2english "हिंदी" |
| 114 | + |
| 115 | + # Legacy CLI |
| 116 | + hindi2english --type hin2eng --input "हिंदी" |
| 117 | + |
| 118 | + # Python usage |
| 119 | + python -c "from indicate import transliterate; print(transliterate.hindi2english('हिंदी'))" |
| 120 | + ``` |
| 121 | + |
| 122 | +## Data |
| 123 | + |
| 124 | +The datasets used to train the model: |
| 125 | + |
| 126 | +- [Indian Election affidavits](https://affidavit.eci.gov.in/CandidateCustomFilter) |
| 127 | +- [Google Dakshina dataset](https://github.com/google-research-datasets/dakshina) |
| 128 | +- [ESPN Cric Info](https://www.espncricinfo.com/hindi/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard) for hindi version of the [english scorecard](https://www.espncricinfo.com/series/pakistan-tour-of-england-2021-1239529/england-vs-pakistan-1st-odi-1239537/full-scorecard) |
| 129 | +- [IIT Bombay English-Hindi Corpus](https://www.cfilt.iitb.ac.in/iitb_parallel/) |
| 130 | + |
| 131 | +## Evaluation |
| 132 | + |
| 133 | +Model was evaluated on test dataset of Google Dakshina dataset, Model predicted 73.64% exact matches. |
| 134 | +[Indic-trans](https://github.com/libindic/indic-trans) predicted 63.12% exact matches on Google Dakshina dataset. |
| 135 | + |
| 136 | +Below is the edit distance metrics on test dataset (0.0 mean exact match, the farther away from 0.0, the difference is more between predicted text and actual text): |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | +## Authors |
| 141 | + |
| 142 | +Rajashekar Chintalapati and Gaurav Sood |
| 143 | + |
| 144 | +## Contributor Code of Conduct |
| 145 | + |
| 146 | +The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the [Contributor Code of Conduct](http://contributor-covenant.org/version/1/0/0/). |
| 147 | + |
| 148 | +## License |
| 149 | + |
| 150 | +The package is released under the [MIT License](https://opensource.org/licenses/MIT). |
0 commit comments