This repository contains the code and replication scripts for IntraTyper, a modified version of the deep learning type inferring tool DeepTyper by Hellendoorn et al. (2018). DeepTyper has been trained and evaluated on a set of different projects, a so-called inter-project environment. In contrast to that, IntraTyper is trained and evaluated in an intra-project setting. This means that IntraTyper is specifically tailored for one project. As the results of the experiments show, due to this specific setting, the tool excels at predicting relatively uncommon, project-specific types.
IntraTyper uses the CNTK library.
Therefore, an environment which supports CNTK is necessary.
- Execute the bash script
data/cloner.sh. This will clone all the repositories mentioned in thedata/repo-SHAs.txtfile and reset them to the SHA commits of 28th February 2018. - Copy the created
data/Reposdirectory and name itdata/Repos-cleaned. - Run
node CleanRepos.js. This will create corresponding tokenized data and type (*.ttokens) files in Repos-cleaned. Furthermore, it scrapes all user-added type annotations of the source code and stores them in*.ttokens.purefiles. - Run
node GetTypes.js. This will create three directories. In each directory, each line corresponds to a TypeScript file. Each line contains space-separated TypeScript tokens followed by the corresponding space-separated types. A tab separates the source-tokens and type-tokens.outputs-allcontains data in which every identifier is annotated with its inferred type. This will be used for training data.outputs-purecontains only the real user-added type annotations for the TypeScript code (andno-typeelsewhere); this is used for evaluation (GOLD data)outputs-checkjscontains the TSc+CheckJS inferred types for every identifier. This can be used for comparing performance with TSc+CheckJS.
- In the following, choose between the
intra-xyz.pyandinter-xyz.pyscripts, depending on which setting you want to build. Hereafter, for simplicity, theintrascripts are used but can always be replaced with theinterscripts. - Run
intra_data_split.pyto create an 80% train, 10% valid and 10% test split, as well as source and target vocabularies. This will also create a txt file containing all the projects/source files chosen for the test split in the inter-project/intra-project setting respectively. - Convert the train/valid/test data to CNTK compatible
.ctfinput files by using CNTK's txt2ctf script:
python txt2ctf.py --map data/source_wl data/target_wl --input data/train.txt --output data/train.ctf
python txt2ctf.py --map data/source_wl data/target_wl --input data/valid.txt --output data/valid.ctf
python txt2ctf.py --map data/source_wl data/target_wl --input data/test.txt --output data/test.ctf
- Adjust the epoch size of
intra_infer.pyandintra_evaluation.pyaccording to the output ofintra_data_split.pyin the line "Overall tokens: [xyz] train". - Run
intra_infer.pyto train the neural net over 10 epochs. - Choose the model with the best evaluation error and provide its path to the
model_filevariable inintra_evaluation.py. - Run
intra_evaluation.pyto let the model predict the corresponding types in the test data set. The results are written to theresultsdirectory in a txt file. The txt file contains four columns which are defined in the following way:
true type | prediction | confidence of prediction | rank of prediction
- To create a plot of the resulting prediction-accuracies, run the script
analyze_result.py.