Skip to content

Conversation

@satvshr
Copy link
Collaborator

@satvshr satvshr commented Aug 14, 2025

NOTE: This PR is parked until DeepDNAshape call is replaced (refer to #110).
closes #82

@satvshr satvshr self-assigned this Aug 14, 2025
@satvshr satvshr added the enhancement New feature or request label Aug 14, 2025
@satvshr
Copy link
Collaborator Author

satvshr commented Aug 16, 2025

@NennoMP Could you checkout the model architecture? It should match their implementation here, which I think it is doing. Kindly ignore the comments left by GPT, I made it put them there in order to understand the flow of shapes throughout the network. I will remove them later.

@NennoMP
Copy link
Collaborator

NennoMP commented Aug 16, 2025

Sure. I will do it during the weekend.

@NennoMP NennoMP self-requested a review August 16, 2025 18:56
Copy link
Collaborator

@NennoMP NennoMP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.

EDIT: I just noticed that some shapes/layer dimension in their code do not match the paper? In Figure 3 the output of the BiLSTMs is highlighted as (m, 124), which means these layers have a hidden_size = 64 which is then multiplied by factor 2 due to the bidirectionality. In their code they use hidden_size = 100 in create_combine(...). They do use hidden_size = 64 in create_bilstm(...) though.

At this point I am not sure which methods they actually use to build the model or not. If you discover more let me know.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 18, 2025

I left a few comments.

I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.

EDIT: I just noticed that some shapes/layer dimension in their code do not match the paper? In Figure 3 the output of the BiLSTMs is highlighted as (m, 124), which means these layers have a hidden_size = 64 which is then multiplied by factor 2 due to the bidirectionality. In their code they use hidden_size = 100 in create_combine(...). They do use hidden_size = 64 in create_bilstm(...) though.

At this point I am not sure which methods they actually use to build the model or not. If you discover more let me know.

I had observed that they mentioned using 2Dconvolutions but in the code they used 1D, Franz said to just stick with what is being done in the code so I no longer am looking at the paper.

@NennoMP
Copy link
Collaborator

NennoMP commented Aug 18, 2025

I had observed that they mentioned using 2Dconvolutions but in the code they used 1D, Franz said to just stick with what is being done in the code so I no longer am looking at the paper.

Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it.

Anyway, 1D convolutions are probably correct. If they are processing aptamers/proteins sequences as I would expect, their shape should look like (batch, length, n_features), The length is treated as the channel dimension and the 1D kernel slides through the features, I guess? Thus I don't see how they could even use 2D kernels on one-dimensional feature vectors.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 18, 2025

Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it.

Have a look here.

I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.

I thought you'd be used to it at this point :D

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 18, 2025

@NennoMP is there no framework built on top of pytorch which does the training for us and hence does not need us to write training code, like lighting maybe?

@NennoMP
Copy link
Collaborator

NennoMP commented Aug 18, 2025

Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it.

Have a look here.

I see. Either a typo or they could have just (wrongly) abused the term. I say this because the Fig. 3 showcasing the architecture is graphically representing a 1D convolution (where you see the dotted lines in the top Conv + ReLU + Pool block.

If this is the only typo I think the paper can still be helpful to cross-check the code implementation.

@NennoMP
Copy link
Collaborator

NennoMP commented Aug 18, 2025

@NennoMP is there no framework built on top of pytorch which does the training for us and hence does not need us to write training code, like lighting maybe?

That's a good idea... I think lightining should definitely be able to train nn.Module instances? We should look into it. I have never used it but I guess for the library scope it would be a lot better if we can delegate all training to another package.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 19, 2025

Cannot use lightning due to lower numpy version requirements because of Deep DNAshape

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 19, 2025

@fkiraly Any idea how to setup an environment to run tests for a downgraded version of numpy and tensorflow (deepDNAshape)?

@satvshr satvshr requested a review from NennoMP August 19, 2025 20:30
@satvshr
Copy link
Collaborator Author

satvshr commented Aug 21, 2025

@fkiraly Tests passing locally in my own environment, parking this PR till the DeepDNAshape problem gets sorted out.

Copy link
Collaborator

@NennoMP NennoMP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, don't know if @fkiraly agrees or not but I think they would be improvements.

and its predicted binding probability.
"""

def __init__(self, model, use_126_shape=True, device="cpu"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like use_126_shape variable name 😁, can we find a easier to interpret one? What is the difference between the 126 and 138 shape? Would a flag like full_dna_shape: bool describe it well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree, was not a fan of the variable name either and was honestly waiting for franz to suggest a name 😅 I'll change it to what you're suggesting as it is way better!

Copy link
Contributor

@fkiraly fkiraly Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that was a question I had too. how do we arrive at 126 or 138 respectively?

Copy link
Collaborator Author

@satvshr satvshr Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this today in the standup and I also mentioned it over here.

TLDR: Both python(DeepDNAshape) and R(original) implementations of DNAshape output 138 vectors, but in the case of R it contains 12 NA vectors (hence 138-12 = 126). DeepAptamer uses 126 vectors after removing the NA values from the R implementation.

sorry for missing the question the first time @NennoMP

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed var name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be resolved or is there something else to work on @NennoMP ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be resolved or is there something else to work on @NennoMP ?

I left it open because Franz was also interested in this and I don't know what he thinks about the new version. For me it's ok, but I would leave it open for his reference.

@satvshr satvshr marked this pull request as ready for review August 21, 2025 12:45
@satvshr satvshr requested a review from NennoMP August 25, 2025 19:26
@NennoMP
Copy link
Collaborator

NennoMP commented Aug 25, 2025

@satvshr I think the architecture/model doesn't need further changes. The only thing I would provide feedback on are the docstrings. To avoid commenting on work-in-progress stuff, are they finished and ready for review or are you still working on them?

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 25, 2025

@satvshr I think the architecture/model doesn't need further changes. The only thing I would provide feedback on are the docstrings. To avoid commenting on work-in-progress stuff, are they finished and ready for review or are you still working on them?

They're ready, have a go.

Copy link
Collaborator

@NennoMP NennoMP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments about docstrings.

Main questions:

  • Why do we need to one-hot encode already numerical labels (e.g., 1 to [1, 0]) in preprocess_y? Also, I don't see the method being used anywhere.
  • Currently, DeepAptamer expects DNA sequences of length <= 35 due to how padding is applied in pad_sequence. Can't we make this generic via a parameter? and pass a default value of 35 from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 26, 2025

Why do we need to one-hot encode already numerical labels (e.g., 1 to [1, 0]) in preprocess_y? Also, I don't see the method being used anywhere.

This is something I forgot to ask you and Franz! They do in the paper but I have no idea why. The method is not being used anywhere right now as it will be a part of the notebook.

  • Currently, DeepAptamer expects DNA sequences of length <= 35 due to how padding is applied in pad_sequence. Can't we make this generic via a parameter? and pass a default value of 35 from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.

Was thinking the same thing while talking about PSeAAC today, what was the conclusion on that topic by the way?

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 26, 2025

Everything with a 👍 above has been changed,, keeping the preprocess_y function and the [0, 1] and [1, 0] implementation (for the moment) as it mocks the paper implementation.

@satvshr satvshr requested a review from NennoMP August 26, 2025 11:21
@NennoMP
Copy link
Collaborator

NennoMP commented Aug 26, 2025

Why do we need to one-hot encode already numerical labels (e.g., 1 to [1, 0]) in preprocess_y? Also, I don't see the method being used anywhere.

This is something I forgot to ask you and Franz! They do in the paper but I have no idea why. The method is not being used anywhere right now as it will be a part of the notebook.

I am pretty sure they do not need to do it. This is a multi-cass classification problem (binary, to be specific), and they probably confused it with a multi-output classification problem, where for each sample you need to predict n classes.

Can we try to remove the one-hot-encoding on the labels (i.e., keep them as 0 and 1) and see if it works on the get go? If something fails the cause is probably in their training code because they may expect one-hot-encoder labels.

  • Currently, DeepAptamer expects DNA sequences of length <= 35 due to how padding is applied in pad_sequence. Can't we make this generic via a parameter? and pass a default value of 35 from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.

Was thinking the same thing while talking about PSeAAC today, what was the conclusion on that topic by the way?

The consensus is to make PSeAAC more generic and remove the constraint of protein sequence of length >= 30. Similarly, we should remove these constraints from DeepAptamer as well.

For PSeAAC, I will create a small PR to generalize expected length of protein sequences, I will check your PR #60 to see what you already did to generalize the algorithm.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 26, 2025

wouldn't bet on it, but I am pretty sure they do not need to do it. This is a multi-cass classification problem (binary, to be specific), and they probably confused it with a multi-output classification problem, where for each sample you need to predict n classes.

Hmmm, makes sense. I'm typing from my phone so I can't quote reply, but in #60 I think was focused more on making it more generalized in terms of feature groups. If it is interfering with your integration, feel free to make the lambda_value more flexible. If not, I'll make it a part of #60 whenever I get to it, though it's not a priority at the moment.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 26, 2025

we should remove these constraints from DeepAptamer as well.

Already did that in the latest commit.

@NennoMP
Copy link
Collaborator

NennoMP commented Aug 26, 2025

One minor comment about a docstring which still mentioned length of 35, and another one about the calls to pre-processing inside the pipeline.

Default values to 35 in the utility methods are ok. However, since the purpose of such preprocessing/padding is to bring all sequences to the same length (right?), I was wondering whether it is not better to compute the maximum length in the dataset inside the pipeline (where I left the comment) and use that as our seq_len in the utility methods.

This should make DeepAptamer works with any dataset. Opinions?

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 26, 2025

This should make DeepAptamer works with any dataset. Opinions?

Right.

This should make DeepAptamer works with any dataset. Opinions?

seems more logical too.

@satvshr satvshr requested a review from NennoMP August 26, 2025 12:55
Copy link
Collaborator

@NennoMP NennoMP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various (minor) comments on docstrings.

One about passing device as a string (e.g., "cpu) and then using it in torch.tensor(...). I don't know whether this works as torch.tensor(...) expectds device to be a torch.device instance. Should check whether it works or an error is raised. Even if it works I think it would be better to make the parameter a torch.device instance (consistent with what pytorch has in its documentation).

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 31, 2025

Will addressthe above comments in the thread, but the absolute features values like 4 and 126 can't be changed as they are fixed outputs, 4 because of ohe of the nucleotides into vectors of ACGT. 126 because that is the fixed shape output.

@NennoMP
Copy link
Collaborator

NennoMP commented Aug 31, 2025

Will addressthe above comments in the thread, but the absolute features values like 4 and 126 can't be changed as they are fixed outputs, 4 because of ohe of the nucleotides into vectors of ACGT. 126 because that is the fixed shape output.

Ok, 4 being fixed makes sense but is it true also for 126. It can be either 126 or 138 no? Depending on whether the pipeline uses full_dna_shape = False or full_dna_shape = True. At least that's what I understood looking at the pipeline docstring.

@satvshr
Copy link
Collaborator Author

satvshr commented Aug 31, 2025

Ok, 4 being fixed makes sense but is it true also for 126. It can be either 126 or 138 no? Depending on whether the pipeline uses full_dna_shape = False or full_dna_shape = True. At least that's what I understood looking at the pipeline docstring.

Yes that is true, will edit the comment based on that.

@satvshr satvshr requested a review from NennoMP August 31, 2025 19:45
@satvshr
Copy link
Collaborator Author

satvshr commented Sep 1, 2025

@NennoMP can you resolve old comments if they have been fixed now?

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, high-level - can you kindly ensure the new tests pass?

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 12, 2025

can you kindly ensure the new tests pass?

Can you define "new tests" (ambiguous, ha!)? The test I wrote for deepaptamer passes locally.

@fkiraly
Copy link
Contributor

fkiraly commented Sep 15, 2025

look at the tests, they are failing due to import errors

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 15, 2025

look at the tests, they are failing due to import errors

We edcided to park this until deepDNAshape gets replaced, so the error is going to be there for now. If you want to merge this I could soft import it and skip tests for now though

@fkiraly
Copy link
Contributor

fkiraly commented Sep 15, 2025

I see, so this is the point where we are going to park this PR, right?

@satvshr
Copy link
Collaborator Author

satvshr commented Sep 15, 2025

I see, so this is the point where we are going to park this PR, right?

If it is in a "mergable" state, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] clean implementation of DeepAptamer

4 participants