Skip to content

Commit 17eb9e9

Browse files
committed
add examples
1 parent 75d8fb1 commit 17eb9e9

File tree

6 files changed

+1291
-2
lines changed

6 files changed

+1291
-2
lines changed

examples/README.md

Lines changed: 115 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Examples
1+
# Indicate Examples
22

33
This directory contains practical examples for using the `indicate` package.
44

@@ -36,6 +36,45 @@ Shows production-ready file handling:
3636
python examples/file_processing.py
3737
```
3838

39+
### 📊 `pandas_usage.py`
40+
**DataFrame processing with pandas** (reference implementation):
41+
- Process entire DataFrame columns
42+
- Batch processing for efficiency
43+
- Multiple column transliteration
44+
- Error handling and progress tracking
45+
- Optimal settings for large datasets
46+
47+
**Note**: This is a reference implementation. Copy and adapt the `transliterate_dataframe()` function to your needs.
48+
49+
**Run it:**
50+
```bash
51+
# Install dependencies
52+
pip install pandas tqdm
53+
54+
# Set API key and run
55+
export OPENAI_API_KEY=your-key
56+
python examples/pandas_usage.py
57+
```
58+
59+
### 💾 `large_dataset_with_checkpoints.py`
60+
**Processing large datasets with reliability features**:
61+
- Automatic checkpointing to Parquet files
62+
- Resume from interruptions
63+
- Progress saving and recovery
64+
- Emergency saves on Ctrl+C
65+
- Parallel processing strategies
66+
67+
Perfect for datasets with hundreds of thousands of entries where processing might take hours.
68+
69+
**Run it:**
70+
```bash
71+
# Install dependencies
72+
pip install pandas tqdm pyarrow
73+
74+
# Process large dataset
75+
python examples/large_dataset_with_checkpoints.py
76+
```
77+
3978
## Quick Examples
4079

4180
### Simple CLI Usage
@@ -71,6 +110,36 @@ results = trans.transliterate_batch(texts)
71110
print(results) # ['Rajesh', 'Gaurav', 'Priya']
72111
```
73112

113+
### DataFrame Processing (from pandas_usage.py)
114+
115+
```python
116+
import pandas as pd
117+
# Copy the transliterate_dataframe function from pandas_usage.py
118+
119+
df = pd.read_csv('your_data.csv')
120+
result = transliterate_dataframe(
121+
df,
122+
source_column='hindi_text',
123+
target_column='english_text',
124+
batch_size=100
125+
)
126+
```
127+
128+
### Large Dataset with Checkpoints
129+
130+
```python
131+
# Use for datasets that take long to process
132+
from large_dataset_with_checkpoints import transliterate_with_checkpoints
133+
134+
result = transliterate_with_checkpoints(
135+
df,
136+
source_column='text',
137+
checkpoint_path='progress.parquet',
138+
save_every=50, # Save every 50 batches
139+
resume=True # Resume if interrupted
140+
)
141+
```
142+
74143
## Prerequisites
75144

76145
### For LLM Examples
@@ -84,6 +153,12 @@ export GOOGLE_API_KEY=your-google-key
84153
### For TensorFlow Examples
85154
No setup needed - uses pre-trained models.
86155

156+
### For DataFrame Examples
157+
```bash
158+
pip install pandas tqdm # Basic DataFrame processing
159+
pip install pandas tqdm pyarrow # With checkpointing support
160+
```
161+
87162
## Supported Language Pairs
88163

89164
### LLM Backend (Full Support)
@@ -104,15 +179,38 @@ No setup needed - uses pre-trained models.
104179
### TensorFlow Backend
105180
- **Hindi** → English only
106181

182+
## Performance Guidelines
183+
184+
### For Large Datasets (300k+ words)
185+
186+
1. **Batch Size**: Use 100-200 for optimal API efficiency
187+
2. **Delay**: Set to 0.05-0.1 seconds to avoid rate limits
188+
3. **Checkpointing**: Save every 50-100 batches for safety
189+
4. **Model Selection**:
190+
- `gpt-4o`: Best quality
191+
- `gpt-3.5-turbo`: Cost-effective
192+
- `claude-3-haiku`: Fast and cheap
193+
5. **Parallel Processing**: Split large datasets and process in parallel
194+
195+
### Expected Performance
196+
197+
For 300,000 words:
198+
- **Batch size 200**: ~1,500 API calls
199+
- **Time**: 5-10 minutes (depends on API)
200+
- **Cost**: $15-30 (depends on model)
201+
- **With checkpointing**: Safe to interrupt and resume
202+
107203
## File Formats
108204

109205
### Input
110206
- **Text files**: Plain UTF-8 text, one item per line
111207
- **JSON files**: Structured format from previous results
208+
- **CSV files**: For DataFrame processing
112209

113210
### Output
114211
- **Text format**: Plain transliterated text
115212
- **JSON format**: Rich metadata with error handling
213+
- **Parquet format**: For checkpointing and large datasets
116214

117215
```json
118216
{
@@ -142,4 +240,19 @@ All examples demonstrate:
142240
-**Atomic writing** - no partial/corrupted files
143241
-**Automatic backups** - protects existing data
144242
-**Error recovery** - handles API failures gracefully
145-
-**Progress tracking** - resume interrupted operations
243+
-**Progress tracking** - resume interrupted operations
244+
-**Checkpointing** - save progress for long-running tasks
245+
246+
## Important Notes
247+
248+
1. The DataFrame functions (`pandas_usage.py`) are **reference implementations** - not part of the core package
249+
2. Always test with a small sample first
250+
3. Monitor your API usage and costs
251+
4. Use checkpointing for any dataset that takes >5 minutes
252+
5. These examples are designed to be copied and adapted to your needs
253+
254+
## Support
255+
256+
For issues or questions:
257+
- GitHub: https://github.com/in-rolls/indicate
258+
- Documentation: See main README.md

0 commit comments

Comments
 (0)