Download the raw dataset from google drive https://drive.google.com/file/d/1GgI-ebyLE1J6rkE6KzpB3e5TkRn2UYty/view?usp=drive_link
Extract raw_dataset.zip
in the directory.
pip install -r requirements.txt
Run
python make_dataset.py --clean
If you use my code or ideas from my paper in your work, please cite my paper.
@article{arnob2024indicdialogue,
title={IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling},
author={Arnob, Noor Mairukh Khan and Faiyaz, A and Fuad, Md Mubtasim and Al Masud, Shah Murtaza Rashid and Das, Baivab and Mridha, MF},
journal={Data in Brief},
volume={55},
pages={110690},
year={2024},
publisher={Elsevier}
}