This document introduces basic usage of GLP package. The main components of this package are frequent subgraph pattern miner; gSpan and (multiple) graph PLS regression.
- Cmake 2.6 or higher.
- Eigen 3.1.2 or higher.
- Boost 1.4.2 or higher.
- (optional) Open-babel
$ make
$ make install$ make
$ make install
$ brew link glp$ cd data
$ python ../tools/sdf2gsp.py -f mutagen.sdf -l mutagen.label -o mutagen.gspIn this example, we use 80% data for training, and the rest 20% data for test.
Generate train/test gsp data first,
$ python ../tools/mk_train_test.py -f mutagen.gsp -b mutagen -r 0.8
[INFO] total: 684
[INFO] Generating train data now: 547
[>>>>>>>>>>>>>>> 100% >>>>>>>>>>>>>>]
[INFO] Generating test data now: 137
[>>>>>>>>>>>>>>> 100% >>>>>>>>>>>>>>]
[INFO] All Done!Then generate responses,
# single
$ head -547 mutagen.label > mutagen_train.label
$ tail -137 mutagen.label > mutagen_test.label
# multiple
$ head -547 Ymatrix.txt > mutagen_train_multiple.label
$ tail -137 Ymatrix.txt > mutagen_test_multiple.labelAn exemplar usage of mining frequent subgraph patterns from graph data
$ gspan -m 10 -L 10 mutagen.gspwhere -m specifies minimum support, and -L specifies maximum pattern size.
It produces three output files, and they are
gspan_m10_nINF_L10_DFS.txt, which contains subgraph patterns in DFS format. For details about DFS format, please see APPENDIX.gspan_m10_nINF_L10_Freq.txt, which contains the frequencies of the enumerated subgraph patterns.gspan_m10_nINF_L10_Features.txtwhich contains binary occurrence matrix of the obtained subgraph patterns.
Type gspan to see more options.
For performing gPLS with a single response label
$ gspls-train --cla -o 0 -n 10 -y mutagen_train.label mutagen_train.gspFor performing gPLS with multiple response labels
$ gspls-train --cla -o 0 -n 10 \
-y mutagen_train_multiple.label mutagen_train.gspwhere --cla speficies classfication mode. Default is regression mode --reg, and -o specifies the number of training data to be separated for validation (in order to choose the number of latent vectors), and -n specifies the maximum number of iterations,
and -y specifies the file with response values.
In multiple-PLS mode, the corresponding file should contain response matrix.
It produces two output files, and they are
gspls_m2_L10_n10_k1_t3_DFS.txt, which contains subgraph patterns in DFS format.gspls_m2_L10_n10_k1_t3_Beta.txt, which contains regression coefficients.
Type gspls-train to see more options.
$ gspls-classify --cla -d gspls_m2_L10_n10_k1_t3_DFS.txt \
-b gspls_m2_L10_n10_k1_t3_Beta.txt -y mutagen_test.label mutagen_test.gspwhere -b specifies the file with learned regression weights, and
-y specifies the file with response values.
In multiple-PLS mode, the corresponding file contains response matrix.
Type gspls-classify to see more options.
The scripts used here only works with Open-babel installed and need OS X environment
One way to visualize obtained subgraph patterns is to first convert the obtained raw DFS files into SDF/MOL files. Then one can use chemical conversion software openbabel to generate PNG files.
$ mkdir -p sdfs
$ ../tools/dfs2sdf.pl gspls_m2_L10_n10_k1_t3_DFS.txt
$ ../tools/sdf2png.sh
$ open sdfs/1_10.svg.png- Shao, Z., Hirayama, Y., Yamanishi, Y., Saigo, H.: Mining discriminative patterns from graph data with multiple labels and its application to QSAR, to appear in Journal of Chemical Information and Modeling, 2015.
- Saigo, H., Kashima, H., Tsuda, K.: Fast iterative mining using sparsity-inducing loss functions , IEICE Transaction on Information and Systems, Vol.E96-D No.8 pp.1766-1773, 2013.
- T. Kudo, E. Maeda and Y. Matsumoto: An application of boosting to graph classification. In Advances in Neural Information Processing Systems 17, 729-736, MIT Press, 2005.
- H. Saigo, T. Kadowaki and K. Tsuda: A linear programming approach for molecular QSAR analysis. International Workshop on Mining and Learning with Graphs 2006, 85-96, 2006.
- H. Saigo, S. Nowozin, T. Kudo and K. Tsuda: gBoost: A mathematical programming approach to graph classification and regression, Machine Learning, 75(1) 69-89, 2009.
- H. Saigo, N. Kraemer and K. Tsuda: Partial Least Squares Regression for Graph Mining, In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2008), 578-586, 2008.
- H. Saigo and K. Tsuda: Iterative Subgraph Mining for Principal Component Analysis In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM2008), 1007-1012, 2008.
Zheng Shao, Taku Kudo, Hiroto Saigo, Koji Tsuda
- github issue
- axot[at]msn.com
- hiroto.saigo[at]gmail.com
Let’s begin with the following simple example, which stands C-C.
(6) 1 (0f6)
The first (6) stands for the atom label (6 means Carbon). 1 (0f6) means that there exists a forward edge from the atom index 0 to the next atom labeled 6, with bond label 1 (single bond). It then from a single chemical structure C-C. The atom indices are automatically given by the order of their appearance, and starts with 0.
The next example stands for C-C=C-C=C-C=, where the last bond is connected to the first atom, and compose a ring structure.
(6) 1 (0f6) 2 (1f6) 1 (2f6) 2 (3f6) 1 (4f6) 2 (b0)
2 (b0)at the end of the string means that there exists a double bond from the current atom to the 1st atom (index 0). This bond is named as backward edge. Only backward edge makes circular structure, or cyclic graphs.
Code is under the GPL v2.0.