We provide the exact scripts to reproduce our results for SkyRL-SQL-7B. The implementation is based on SearchR1 implementation of agent loop.
Make sure to have followed the installation commands in INSTALL.md.
Start ray in your cluster following the guide: https://docs.ray.io/en/latest/ray-core/starting-ray.html.
We provide the dataset we used on HuggingFace: https://huggingface.co/datasets/NovaSky-AI/SkyRL-SQL-653-data
Make sure to download the dataset and update the path in DATA_PATH in the script.
huggingface-cli download NovaSky-AI/SkyRL-SQL-653-data --local-dir <path_to_dir> --repo-type datasetWe use a .env.sql file to pass environment variables to all the processes created by Ray. Make sure to set WANDB_API_KEY (other env vars like ALLHANDS_API_KEY are only used for the Swe-Bench task).
Make sure to setup the database files needed for training. We use the database from OmniSQL.
You can download the datasets from:
The datasets include BIRD, Spider, ScienceBenchmark, EHRSQL, Spider2-SQLite, Spider-DK, Spider-Realistic, Spider-Syn, and SynSQL-2.5M. In our training pipeline, we only need to access databases from SynSQL-2.5M and Spider.
Unzip data.zip in this folder, and set the corresponding DB_PATH in the training script below. You can e.g. download and unzip the data by running
huggingface-cli download seeklhy/OmniSQL-datasets data.zip --repo-type dataset --local-dir <path_to_file.zip>
unzip <path_to_file.zip>We provide a script run_skyrl_sql.sh for reproducing our results for SkyRL-SQL-7B.
Reference our fork of the OmniSQL repo for evaluating the model.