-
Notifications
You must be signed in to change notification settings - Fork 7
[ENH] Update AptaNet notebook to use AptaTrans data and a pdb for prediction #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
04ab599
dc78e44
c347988
d9537f4
1c46c55
a716872
7781441
e762cc8
e844d4f
f9392ef
beb45ec
d603d07
6ecf576
b91c511
b2428b0
2982954
0b5b388
d24c4d7
0cd72b7
fabc7b4
32633d3
2b08363
f502fed
056c08e
6136c39
ae7d1fe
f339a7b
651d066
7991cc8
8f0f0ae
60ec8da
88c0122
b7a7349
c14c0bb
d1075a7
839c3e5
19a9e98
945addc
3c1fa3a
6e6836e
73773af
e2449e0
212d54b
8a29c09
3b09533
29e8ed9
9d9738f
210f09c
6fb1db5
c57e66a
8c00454
2dbf5d5
8767f11
32d7673
428bf68
f94cd6a
1ac72fc
8278c74
0d16e13
a38f70b
40ba9d3
fad2f97
4a17dac
cace4bd
3301a4c
f61fdbc
28de2a3
17855b3
c6ccb23
8304148
e9711c0
bf20e3c
0955226
e6e3c9b
7273082
ad8db59
90ffc7e
fa869d5
38ec2ca
e0d0af3
5a10d65
0b04ce8
64fd01b
eeef5d6
144b405
ae9a814
c6fe37e
e9846bc
b80f145
eef91ff
bbdaf27
01090c5
6f04f11
63226a5
9a74b71
4266183
ac6851c
3adb75a
2aeea11
912ec72
e29380a
36f750d
badfafe
66a1620
98327d8
f264228
4a3a622
11eef33
1b3e178
28434c9
67ce5bb
1988408
1cd198b
cc64583
355718e
9f94386
08297c0
bc8355f
5882b24
06a44a4
183c48b
4a8f9dd
492e054
e5b1ddd
84eefa1
1c78a4b
ea608f4
d9a5df9
55dcb86
96e6f54
d7da265
bce4cd1
21901d7
624acad
4273dcc
de7fbc0
83c072f
98a3a1f
57a113f
3d1f930
1f400b0
8b46667
b91eaad
3f808d5
5e3f4d6
0e79b7f
179fefc
f11853d
8d875e2
effb111
2dcc61f
15029ad
2fe0e9d
e62a32b
2c196be
aef8589
b3abadb
4ace402
45094f6
e38770e
aa9677b
b5724dc
7e25cd4
0466310
efb6fcf
022d748
63ddaf6
d27a495
c982090
2c06084
7027d47
ec96bba
d6111a9
2e2d71f
90af1ee
7adff54
971ae29
fe19150
9162d0c
b9c7d6c
24b6e90
3939af5
207cb92
835cfe8
d0997dd
13f89d3
d02bc2e
f6d02e8
54f414b
47c0767
51ccc41
4dd92bd
974a2fa
eb7c257
9ec9e13
bb80a81
18833f0
15aa1cf
b1708a2
1467a62
5e2a9fa
84a2754
afc4010
601c381
fe53b18
f10f1ee
08a8640
deb04af
166aa1a
86605d4
435f093
6624bcd
e9cc6a9
3ac6bdf
e3df365
e8757b4
e154f3e
ff19456
b0df3af
b80a9fc
42e51fa
1f59916
3cab1b2
dd784b4
f0893ea
9e9cb80
1036315
38f1784
8713bc0
bbc01ee
5d31c9c
433af0d
4bb2278
2d2a959
5fc3b8d
521f47d
913791b
c24d6ac
fcf680a
e03438c
2df3862
a0b89ad
54d7337
1dcb7f9
cad9201
c35a1c5
e7c9c65
bd7da66
5789bf5
c79eda8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| __author__ = "satvshr" | ||
| __all__ = ["load_train_li2014", "load_test_li2014"] | ||
| import os | ||
|
|
||
| import pandas as pd | ||
|
|
||
|
|
||
| def load_train_li2014(): | ||
| """ | ||
| Load the Li 2014 training dataset. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add some description of the dataset here. What are columns, what do they mean, what can the values be, etc. |
||
| Returns | ||
| ------- | ||
| X : pandas.DataFrame | ||
| Feature matrix. | ||
| y : pandas.Series | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would make both |
||
| Labels/target. | ||
| """ | ||
| # Path relative to this file | ||
| path = os.path.abspath( | ||
| os.path.join(os.path.dirname(__file__), "..", "data", "train_li2014.csv") | ||
| ) | ||
|
|
||
| df = pd.read_csv(path) | ||
|
|
||
| # Basic assumption: last column is the label | ||
| X = df.iloc[:, :-1] | ||
| y = df.iloc[:, -1] | ||
|
|
||
| return X, y | ||
|
|
||
|
|
||
| def load_test_li2014(): | ||
| """ | ||
| Load the Li 2014 test dataset. | ||
| Returns | ||
| ------- | ||
| X : pandas.DataFrame | ||
| Feature matrix. | ||
| y : pandas.Series | ||
| Labels/target. | ||
| """ | ||
| # Path relative to this file | ||
| path = os.path.abspath( | ||
| os.path.join(os.path.dirname(__file__), "..", "data", "test_li2014.csv") | ||
| ) | ||
|
|
||
| df = pd.read_csv(path) | ||
|
|
||
| # Basic assumption: last column is the label | ||
| X = df.iloc[:, :-1] | ||
| y = df.iloc[:, -1] | ||
|
|
||
| return X, y | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| __author__ = "satvshr" | ||
|
|
||
| import pandas as pd | ||
|
|
||
| from pyaptamer.datasets._loaders._csv_loader import load_csv_dataset | ||
|
|
||
| DATASET_NAME = "train_li2014" | ||
| TARGET_COL = "label" | ||
|
|
||
|
|
||
| def test_load_csv_returns_df(): | ||
| """ | ||
| When return_X_y=False the loader should return the full DataFrame containing the | ||
| target column. | ||
| """ | ||
| df = load_csv_dataset(DATASET_NAME) | ||
|
|
||
| assert isinstance(df, pd.DataFrame), "Returned object should be a pandas DataFrame" | ||
| assert TARGET_COL in df.columns, ( | ||
| f"DataFrame must contain the target column '{TARGET_COL}'" | ||
| ) | ||
| assert df.shape[0] > 0, "DataFrame should not be empty" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| __author__ = "satvshr" | ||
|
|
||
| import pandas as pd | ||
| import pytest | ||
|
|
||
| from pyaptamer.datasets._loaders._li2014 import ( | ||
| load_test_li2014, | ||
| load_train_li2014, | ||
| ) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "loader", | ||
| [load_train_li2014, load_test_li2014], | ||
| ) | ||
| def test_loader_li2014(loader): | ||
| """ | ||
| The loader should return a tuple (X, y) where: | ||
| - X is a DataFrame | ||
| - y is a Series | ||
| - they have matching lengths | ||
| """ | ||
| X, y = loader() | ||
|
|
||
| assert isinstance(X, pd.DataFrame), "X should be a pandas DataFrame" | ||
| assert isinstance(y, pd.Series), "y should be a pandas Series" | ||
|
|
||
| assert len(X) == len(y), "X and y must have the same number of rows" | ||
| assert X.shape[0] > 0, "X should not be empty" | ||
| assert y.shape[0] > 0, "y should not be empty" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,7 @@ | |
| from itertools import product | ||
|
|
||
| import numpy as np | ||
| import pandas as pd | ||
|
|
||
| from pyaptamer.pseaac import AptaNetPSeAAC | ||
|
|
||
|
|
@@ -59,20 +60,18 @@ def generate_kmer_vecs(aptamer_sequence, k=4): | |
| def pairs_to_features(X, k=4): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this needs changing, this should be a transformer anyway
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was changed to allow |
||
| """ | ||
| Convert a list of (aptamer_sequence, protein_sequence) pairs into feature vectors. | ||
| Also supports a pandas DataFrame with 'aptamer' and 'protein' columns. | ||
|
|
||
| This function generates feature vectors for each (aptamer, protein) pair using: | ||
|
|
||
|
|
||
| - k-mer representation of the aptamer sequence | ||
| - Pseudo amino acid composition (PSeAAC) representation of the protein sequence | ||
|
|
||
|
|
||
| Parameters | ||
| ---------- | ||
| X : list of tuple of str | ||
| A list where each element is a tuple `(aptamer_sequence, protein_sequence)`. | ||
| `aptamer_sequence` should be a string of nucleotides, and `protein_sequence` | ||
| should be a string of amino acids. | ||
| X : list of tuple of str or pandas.DataFrame | ||
| A list where each element is a tuple `(aptamer_sequence, protein_sequence)`, | ||
| or a DataFrame containing 'aptamer' and 'protein' columns. | ||
|
|
||
| k : int, optional | ||
| The k-mer size used to generate the k-mer vector from the aptamer sequence. | ||
|
|
@@ -85,9 +84,14 @@ def pairs_to_features(X, k=4): | |
| for a given (aptamer, protein) pair. | ||
| """ | ||
| pseaac = AptaNetPSeAAC() | ||
|
|
||
| feats = [] | ||
| for aptamer_seq, protein_seq in X: | ||
|
|
||
| if isinstance(X, pd.DataFrame): | ||
| pairs = zip(X["aptamer"], X["protein"], strict=False) | ||
| else: | ||
| pairs = X | ||
|
|
||
| for aptamer_seq, protein_seq in pairs: | ||
| kmer = generate_kmer_vecs(aptamer_seq, k=k) | ||
| pseaac_vec = np.asarray(pseaac.transform(protein_seq)) | ||
| feats.append(np.concatenate([kmer, pseaac_vec])) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make a single
load_li2014(split=None), where the default loads the concatenation of both, and you can also select"train"and"test".