-
Notifications
You must be signed in to change notification settings - Fork 7
[ENH] The DeepAptamer algorithm
#98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@NennoMP Could you checkout the model architecture? It should match their implementation here, which I think it is doing. Kindly ignore the comments left by GPT, I made it put them there in order to understand the flow of shapes throughout the network. I will remove them later. |
|
Sure. I will do it during the weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments.
I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.
EDIT: I just noticed that some shapes/layer dimension in their code do not match the paper? In Figure 3 the output of the BiLSTMs is highlighted as (m, 124), which means these layers have a hidden_size = 64 which is then multiplied by factor 2 due to the bidirectionality. In their code they use hidden_size = 100 in create_combine(...). They do use hidden_size = 64 in create_bilstm(...) though.
At this point I am not sure which methods they actually use to build the model or not. If you discover more let me know.
I had observed that they mentioned using 2Dconvolutions but in the code they used 1D, Franz said to just stick with what is being done in the code so I no longer am looking at the paper. |
Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it. Anyway, 1D convolutions are probably correct. If they are processing aptamers/proteins sequences as I would expect, their shape should look like |
I thought you'd be used to it at this point :D |
|
@NennoMP is there no framework built on top of pytorch which does the training for us and hence does not need us to write training code, like lighting maybe? |
I see. Either a typo or they could have just (wrongly) abused the term. I say this because the Fig. 3 showcasing the architecture is graphically representing a 1D convolution (where you see the dotted lines in the top If this is the only typo I think the paper can still be helpful to cross-check the code implementation. |
That's a good idea... I think lightining should definitely be able to train |
|
Cannot use lightning due to lower numpy version requirements because of |
|
@fkiraly Any idea how to setup an environment to run tests for a downgraded version of |
|
@fkiraly Tests passing locally in my own environment, parking this PR till the |
NennoMP
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, don't know if @fkiraly agrees or not but I think they would be improvements.
pyaptamer/deepatamer/_pipeline.py
Outdated
| and its predicted binding probability. | ||
| """ | ||
|
|
||
| def __init__(self, model, use_126_shape=True, device="cpu"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like use_126_shape variable name 😁, can we find a easier to interpret one? What is the difference between the 126 and 138 shape? Would a flag like full_dna_shape: bool describe it well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree, was not a fan of the variable name either and was honestly waiting for franz to suggest a name 😅 I'll change it to what you're suggesting as it is way better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that was a question I had too. how do we arrive at 126 or 138 respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this today in the standup and I also mentioned it over here.
TLDR: Both python(DeepDNAshape) and R(original) implementations of DNAshape output 138 vectors, but in the case of R it contains 12 NA vectors (hence 138-12 = 126). DeepAptamer uses 126 vectors after removing the NA values from the R implementation.
sorry for missing the question the first time @NennoMP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed var name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be resolved or is there something else to work on @NennoMP ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be resolved or is there something else to work on @NennoMP ?
I left it open because Franz was also interested in this and I don't know what he thinks about the new version. For me it's ok, but I would leave it open for his reference.
|
@satvshr I think the architecture/model doesn't need further changes. The only thing I would provide feedback on are the docstrings. To avoid commenting on work-in-progress stuff, are they finished and ready for review or are you still working on them? |
They're ready, have a go. |
NennoMP
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments about docstrings.
Main questions:
- Why do we need to one-hot encode already numerical labels (e.g.,
1to[1, 0]) inpreprocess_y? Also, I don't see the method being used anywhere. - Currently,
DeepAptamerexpects DNA sequences of length<= 35due to how padding is applied inpad_sequence. Can't we make this generic via a parameter? and pass a default value of35from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.
This is something I forgot to ask you and Franz! They do in the paper but I have no idea why. The method is not being used anywhere right now as it will be a part of the notebook.
Was thinking the same thing while talking about PSeAAC today, what was the conclusion on that topic by the way? |
|
Everything with a 👍 above has been changed,, keeping the |
I am pretty sure they do not need to do it. This is a multi-cass classification problem (binary, to be specific), and they probably confused it with a multi-output classification problem, where for each sample you need to predict Can we try to remove the one-hot-encoding on the labels (i.e., keep them as
The consensus is to make For PSeAAC, I will create a small PR to generalize expected length of protein sequences, I will check your PR #60 to see what you already did to generalize the algorithm. |
Hmmm, makes sense. I'm typing from my phone so I can't quote reply, but in #60 I think was focused more on making it more generalized in terms of feature groups. If it is interfering with your integration, feel free to make the |
Already did that in the latest commit. |
|
One minor comment about a docstring which still mentioned length of Default values to This should make |
Right.
seems more logical too. |
NennoMP
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various (minor) comments on docstrings.
One about passing device as a string (e.g., "cpu) and then using it in torch.tensor(...). I don't know whether this works as torch.tensor(...) expectds device to be a torch.device instance. Should check whether it works or an error is raised. Even if it works I think it would be better to make the parameter a torch.device instance (consistent with what pytorch has in its documentation).
|
Will addressthe above comments in the thread, but the absolute features values like 4 and 126 can't be changed as they are fixed outputs, 4 because of ohe of the nucleotides into vectors of ACGT. 126 because that is the fixed shape output. |
Ok, |
Yes that is true, will edit the comment based on that. |
|
@NennoMP can you resolve old comments if they have been fixed now? |
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, high-level - can you kindly ensure the new tests pass?
Can you define "new tests" (ambiguous, ha!)? The test I wrote for deepaptamer passes locally. |
|
look at the tests, they are failing due to import errors |
We edcided to park this until |
|
I see, so this is the point where we are going to park this PR, right? |
If it is in a "mergable" state, yes. |
NOTE: This PR is parked until
DeepDNAshapecall is replaced (refer to #110).closes #82