Official repository associated with the MSR24 submission "CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code".
The first part of this README describes how to reproduce the data mining procedure of the paper. Then, it introduces how to perform the code change analysis.
The repository is still under construction as we are preparing the release of a second version of the dataset.
You can download the initial version of CodeLL using the following link: https://zenodo.org/records/10248423.
Each .tar.gz file contains a .jsonl file for a given repository. One line of a .jsonl file contains the following information:
{
"repository": str,
"branch" : str,
"prev_branch" : str,
"date": str,
"id": integer,
"imports": List[str],
"methods": List[{
"id": integer,
"class": str,
"name": str,
"params": List[str],
"content": str,
"function_calls": List[{
"id": integer,
"expression": str,
"start_offset": integer,
"end_offset": integer,
"mapping": integer | "added" | "removed"
},
"mapping": integer | "added" | "removed"
}],
"mapping": integer | "added" | "removed"
}It is possible to leverage our dataset for diverse downstream tasks and applications. We will release a script to generate training and test samples for selected downstream tasks in the future.
You can replicate the data mining procedure with your own set of repositories available on Software Heritage by cloning this repository.
Use the swh_miner.py script to retrieve all the branches associated with the repositories.
We use a regex to only retrieve branches that contain a version (see version_regex variable in swh_miner.py).
The input of the script is a .txt file, where a line is a repository URL (e.g., https://pypi.org/project/gensim/).
python swh_miner.py --input_file ./dataset/v1/3k_python_dataset_filtered.txtThe script outputs a download_data.csv file in the input file directory. Each line of that .csv file contains information to download the branches content.
We recommend to manually filter branches to avoid duplicates and mining non-release branches such as pull requests. We will provide a script to automatically perform this filtering in the future.
Use the download_repos.py script to download the releases content.
It takes as input the download_data.csv file, and an output directory.
python download_repos.py \
--download_data_fp ./dataset/v1/download_data.csv \
--output_dir ./dataset/v1/pypi/dataThe script creates one folder per repository, and one subfolder for each release.
Each release folder contains a data.tar.gz file containing the release content.
Use the extract_data.sh shell script to unzip the data.tar.gz files.
bash extract_data.sh ./dataset/v1/data ./dataset/v1/dataThe first program argument is the path to the data, and the second one is the output directory.
The script only extracts .py and requirements.txt files for efficiency reasons.
Use the data_generator.py script to perform the code change analysis and generate the .jsonl files:
python data_generator.py \
--data_dir ./dataset/v1/data \
--download_data_fp ./dataset/v1/download_data.csv \
--output_dir ./dataset/v1/dataCurrently, the code change analysis script compares two contiguous releases of a repository.
However, it is possible to edit the data_generator.py script for comparing two specific releases, for instance. We will also release a more configurable data_generator.py in the future.