1- # Examples
1+ # Indicate Examples
22
33This directory contains practical examples for using the ` indicate ` package.
44
@@ -36,6 +36,45 @@ Shows production-ready file handling:
3636python examples/file_processing.py
3737```
3838
39+ ### 📊 ` pandas_usage.py `
40+ ** DataFrame processing with pandas** (reference implementation):
41+ - Process entire DataFrame columns
42+ - Batch processing for efficiency
43+ - Multiple column transliteration
44+ - Error handling and progress tracking
45+ - Optimal settings for large datasets
46+
47+ ** Note** : This is a reference implementation. Copy and adapt the ` transliterate_dataframe() ` function to your needs.
48+
49+ ** Run it:**
50+ ``` bash
51+ # Install dependencies
52+ pip install pandas tqdm
53+
54+ # Set API key and run
55+ export OPENAI_API_KEY=your-key
56+ python examples/pandas_usage.py
57+ ```
58+
59+ ### 💾 ` large_dataset_with_checkpoints.py `
60+ ** Processing large datasets with reliability features** :
61+ - Automatic checkpointing to Parquet files
62+ - Resume from interruptions
63+ - Progress saving and recovery
64+ - Emergency saves on Ctrl+C
65+ - Parallel processing strategies
66+
67+ Perfect for datasets with hundreds of thousands of entries where processing might take hours.
68+
69+ ** Run it:**
70+ ``` bash
71+ # Install dependencies
72+ pip install pandas tqdm pyarrow
73+
74+ # Process large dataset
75+ python examples/large_dataset_with_checkpoints.py
76+ ```
77+
3978## Quick Examples
4079
4180### Simple CLI Usage
@@ -71,6 +110,36 @@ results = trans.transliterate_batch(texts)
71110print (results) # ['Rajesh', 'Gaurav', 'Priya']
72111```
73112
113+ ### DataFrame Processing (from pandas_usage.py)
114+
115+ ``` python
116+ import pandas as pd
117+ # Copy the transliterate_dataframe function from pandas_usage.py
118+
119+ df = pd.read_csv(' your_data.csv' )
120+ result = transliterate_dataframe(
121+ df,
122+ source_column = ' hindi_text' ,
123+ target_column = ' english_text' ,
124+ batch_size = 100
125+ )
126+ ```
127+
128+ ### Large Dataset with Checkpoints
129+
130+ ``` python
131+ # Use for datasets that take long to process
132+ from large_dataset_with_checkpoints import transliterate_with_checkpoints
133+
134+ result = transliterate_with_checkpoints(
135+ df,
136+ source_column = ' text' ,
137+ checkpoint_path = ' progress.parquet' ,
138+ save_every = 50 , # Save every 50 batches
139+ resume = True # Resume if interrupted
140+ )
141+ ```
142+
74143## Prerequisites
75144
76145### For LLM Examples
@@ -84,6 +153,12 @@ export GOOGLE_API_KEY=your-google-key
84153### For TensorFlow Examples
85154No setup needed - uses pre-trained models.
86155
156+ ### For DataFrame Examples
157+ ``` bash
158+ pip install pandas tqdm # Basic DataFrame processing
159+ pip install pandas tqdm pyarrow # With checkpointing support
160+ ```
161+
87162## Supported Language Pairs
88163
89164### LLM Backend (Full Support)
@@ -104,15 +179,38 @@ No setup needed - uses pre-trained models.
104179### TensorFlow Backend
105180- ** Hindi** → English only
106181
182+ ## Performance Guidelines
183+
184+ ### For Large Datasets (300k+ words)
185+
186+ 1 . ** Batch Size** : Use 100-200 for optimal API efficiency
187+ 2 . ** Delay** : Set to 0.05-0.1 seconds to avoid rate limits
188+ 3 . ** Checkpointing** : Save every 50-100 batches for safety
189+ 4 . ** Model Selection** :
190+ - ` gpt-4o ` : Best quality
191+ - ` gpt-3.5-turbo ` : Cost-effective
192+ - ` claude-3-haiku ` : Fast and cheap
193+ 5 . ** Parallel Processing** : Split large datasets and process in parallel
194+
195+ ### Expected Performance
196+
197+ For 300,000 words:
198+ - ** Batch size 200** : ~ 1,500 API calls
199+ - ** Time** : 5-10 minutes (depends on API)
200+ - ** Cost** : $15-30 (depends on model)
201+ - ** With checkpointing** : Safe to interrupt and resume
202+
107203## File Formats
108204
109205### Input
110206- ** Text files** : Plain UTF-8 text, one item per line
111207- ** JSON files** : Structured format from previous results
208+ - ** CSV files** : For DataFrame processing
112209
113210### Output
114211- ** Text format** : Plain transliterated text
115212- ** JSON format** : Rich metadata with error handling
213+ - ** Parquet format** : For checkpointing and large datasets
116214
117215``` json
118216{
@@ -142,4 +240,19 @@ All examples demonstrate:
142240- ✅ ** Atomic writing** - no partial/corrupted files
143241- ✅ ** Automatic backups** - protects existing data
144242- ✅ ** Error recovery** - handles API failures gracefully
145- - ✅ ** Progress tracking** - resume interrupted operations
243+ - ✅ ** Progress tracking** - resume interrupted operations
244+ - ✅ ** Checkpointing** - save progress for long-running tasks
245+
246+ ## Important Notes
247+
248+ 1 . The DataFrame functions (` pandas_usage.py ` ) are ** reference implementations** - not part of the core package
249+ 2 . Always test with a small sample first
250+ 3 . Monitor your API usage and costs
251+ 4 . Use checkpointing for any dataset that takes >5 minutes
252+ 5 . These examples are designed to be copied and adapted to your needs
253+
254+ ## Support
255+
256+ For issues or questions:
257+ - GitHub: https://github.com/in-rolls/indicate
258+ - Documentation: See main README.md
0 commit comments