[ENH] The `DeepAptamer` algorithm #98

satvshr · 2025-08-14T20:44:52Z

NOTE: This PR is parked until DeepDNAshape call is replaced (refer to #110).
closes #82

satvshr · 2025-08-16T12:33:07Z

@NennoMP Could you checkout the model architecture? It should match their implementation here, which I think it is doing. Kindly ignore the comments left by GPT, I made it put them there in order to understand the flow of shapes throughout the network. I will remove them later.

NennoMP · 2025-08-16T18:56:14Z

Sure. I will do it during the weekend.

pyaptamer/deepatamer/model.py

NennoMP

I left a few comments.

I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.

EDIT: I just noticed that some shapes/layer dimension in their code do not match the paper? In Figure 3 the output of the BiLSTMs is highlighted as (m, 124), which means these layers have a hidden_size = 64 which is then multiplied by factor 2 due to the bidirectionality. In their code they use hidden_size = 100 in create_combine(...). They do use hidden_size = 64 in create_bilstm(...) though.

At this point I am not sure which methods they actually use to build the model or not. If you discover more let me know.

pyaptamer/deepatamer/model.py

satvshr · 2025-08-18T06:56:49Z

I left a few comments.

I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.

EDIT: I just noticed that some shapes/layer dimension in their code do not match the paper? In Figure 3 the output of the BiLSTMs is highlighted as (m, 124), which means these layers have a hidden_size = 64 which is then multiplied by factor 2 due to the bidirectionality. In their code they use hidden_size = 100 in create_combine(...). They do use hidden_size = 64 in create_bilstm(...) though.

At this point I am not sure which methods they actually use to build the model or not. If you discover more let me know.

I had observed that they mentioned using 2Dconvolutions but in the code they used 1D, Franz said to just stick with what is being done in the code so I no longer am looking at the paper.

NennoMP · 2025-08-18T13:19:30Z

I had observed that they mentioned using 2Dconvolutions but in the code they used 1D, Franz said to just stick with what is being done in the code so I no longer am looking at the paper.

Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it.

Anyway, 1D convolutions are probably correct. If they are processing aptamers/proteins sequences as I would expect, their shape should look like (batch, length, n_features), The length is treated as the channel dimension and the 1D kernel slides through the features, I guess? Thus I don't see how they could even use 2D kernels on one-dimensional feature vectors.

satvshr · 2025-08-18T13:24:55Z

Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it.

Have a look here.

I haven't finished looking at the authors' code. To be honest, it has to be one of the worst, monolithic code bases I have seen 😁.

I thought you'd be used to it at this point :D

satvshr · 2025-08-18T13:33:05Z

@NennoMP is there no framework built on top of pytorch which does the training for us and hence does not need us to write training code, like lighting maybe?

NennoMP · 2025-08-18T14:03:46Z

Can you pinpoint the page/section in the paper where they mention using 2D convolutions? I cannot find it.

Have a look here.

I see. Either a typo or they could have just (wrongly) abused the term. I say this because the Fig. 3 showcasing the architecture is graphically representing a 1D convolution (where you see the dotted lines in the top Conv + ReLU + Pool block.

If this is the only typo I think the paper can still be helpful to cross-check the code implementation.

NennoMP · 2025-08-18T14:15:57Z

@NennoMP is there no framework built on top of pytorch which does the training for us and hence does not need us to write training code, like lighting maybe?

That's a good idea... I think lightining should definitely be able to train nn.Module instances? We should look into it. I have never used it but I guess for the library scope it would be a lot better if we can delegate all training to another package.

satvshr · 2025-08-19T17:51:30Z

Cannot use lightning due to lower numpy version requirements because of Deep DNAshape

satvshr · 2025-08-19T20:22:48Z

@fkiraly Any idea how to setup an environment to run tests for a downgraded version of numpy and tensorflow (deepDNAshape)?

satvshr · 2025-08-21T12:31:35Z

@fkiraly Tests passing locally in my own environment, parking this PR till the DeepDNAshape problem gets sorted out.

NennoMP

I left a few comments, don't know if @fkiraly agrees or not but I think they would be improvements.

pyaptamer/deepatamer/_model.py

NennoMP · 2025-08-21T12:34:59Z

pyaptamer/deepatamer/_pipeline.py

+        and its predicted binding probability.
+    """
+
+    def __init__(self, model, use_126_shape=True, device="cpu"):


I don't like use_126_shape variable name 😁, can we find a easier to interpret one? What is the difference between the 126 and 138 shape? Would a flag like full_dna_shape: bool describe it well?

I completely agree, was not a fan of the variable name either and was honestly waiting for franz to suggest a name 😅 I'll change it to what you're suggesting as it is way better!

yes, that was a question I had too. how do we arrive at 126 or 138 respectively?

I mentioned this today in the standup and I also mentioned it over here.

TLDR: Both python(DeepDNAshape) and R(original) implementations of DNAshape output 138 vectors, but in the case of R it contains 12 NA vectors (hence 138-12 = 126). DeepAptamer uses 126 vectors after removing the NA values from the R implementation.

sorry for missing the question the first time @NennoMP

Changed var name

Can this be resolved or is there something else to work on @NennoMP ?

Can this be resolved or is there something else to work on @NennoMP ?

I left it open because Franz was also interested in this and I don't know what he thinks about the new version. For me it's ok, but I would leave it open for his reference.

pyaptamer/deepatamer/_model.py

NennoMP · 2025-08-25T20:30:27Z

@satvshr I think the architecture/model doesn't need further changes. The only thing I would provide feedback on are the docstrings. To avoid commenting on work-in-progress stuff, are they finished and ready for review or are you still working on them?

satvshr · 2025-08-25T20:38:03Z

@satvshr I think the architecture/model doesn't need further changes. The only thing I would provide feedback on are the docstrings. To avoid commenting on work-in-progress stuff, are they finished and ready for review or are you still working on them?

They're ready, have a go.

NennoMP

Comments about docstrings.

Main questions:

Why do we need to one-hot encode already numerical labels (e.g., 1 to [1, 0]) in preprocess_y? Also, I don't see the method being used anywhere.
Currently, DeepAptamer expects DNA sequences of length <= 35 due to how padding is applied in pad_sequence. Can't we make this generic via a parameter? and pass a default value of 35 from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.

pyaptamer/aptanet/_pipeline.py

pyaptamer/deepatamer/_model.py

pyaptamer/deepatamer/_pipeline.py

pyaptamer/deepatamer/_preprocessing.py

pyaptamer/utils/_deepaptamer_utils.py

pyaptamer/deepatamer/_preprocessing.py

satvshr · 2025-08-26T10:46:45Z

Why do we need to one-hot encode already numerical labels (e.g., 1 to [1, 0]) in preprocess_y? Also, I don't see the method being used anywhere.

This is something I forgot to ask you and Franz! They do in the paper but I have no idea why. The method is not being used anywhere right now as it will be a part of the notebook.

Currently, DeepAptamer expects DNA sequences of length <= 35 due to how padding is applied in pad_sequence. Can't we make this generic via a parameter? and pass a default value of 35 from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.

Was thinking the same thing while talking about PSeAAC today, what was the conclusion on that topic by the way?

satvshr · 2025-08-26T11:17:48Z

Everything with a 👍 above has been changed,, keeping the preprocess_y function and the [0, 1] and [1, 0] implementation (for the moment) as it mocks the paper implementation.

NennoMP · 2025-08-26T11:57:51Z

Why do we need to one-hot encode already numerical labels (e.g., 1 to [1, 0]) in preprocess_y? Also, I don't see the method being used anywhere.

This is something I forgot to ask you and Franz! They do in the paper but I have no idea why. The method is not being used anywhere right now as it will be a part of the notebook.

I am pretty sure they do not need to do it. This is a multi-cass classification problem (binary, to be specific), and they probably confused it with a multi-output classification problem, where for each sample you need to predict n classes.

Can we try to remove the one-hot-encoding on the labels (i.e., keep them as 0 and 1) and see if it works on the get go? If something fails the cause is probably in their training code because they may expect one-hot-encoder labels.

Currently, DeepAptamer expects DNA sequences of length <= 35 due to how padding is applied in pad_sequence. Can't we make this generic via a parameter? and pass a default value of 35 from the pipeline if that's that value used by the authors? I feel like otherwise this is very restrictive for the aptamer inputs.

Was thinking the same thing while talking about PSeAAC today, what was the conclusion on that topic by the way?

The consensus is to make PSeAAC more generic and remove the constraint of protein sequence of length >= 30. Similarly, we should remove these constraints from DeepAptamer as well.

For PSeAAC, I will create a small PR to generalize expected length of protein sequences, I will check your PR #60 to see what you already did to generalize the algorithm.

satvshr · 2025-08-26T12:03:07Z

wouldn't bet on it, but I am pretty sure they do not need to do it. This is a multi-cass classification problem (binary, to be specific), and they probably confused it with a multi-output classification problem, where for each sample you need to predict n classes.

Hmmm, makes sense. I'm typing from my phone so I can't quote reply, but in #60 I think was focused more on making it more generalized in terms of feature groups. If it is interfering with your integration, feel free to make the lambda_value more flexible. If not, I'll make it a part of #60 whenever I get to it, though it's not a priority at the moment.

satvshr · 2025-08-26T12:03:45Z

we should remove these constraints from DeepAptamer as well.

Already did that in the latest commit.

pyaptamer/deepatamer/_pipeline.py

NennoMP · 2025-08-26T12:30:08Z

One minor comment about a docstring which still mentioned length of 35, and another one about the calls to pre-processing inside the pipeline.

Default values to 35 in the utility methods are ok. However, since the purpose of such preprocessing/padding is to bring all sequences to the same length (right?), I was wondering whether it is not better to compute the maximum length in the dataset inside the pipeline (where I left the comment) and use that as our seq_len in the utility methods.

This should make DeepAptamer works with any dataset. Opinions?

satvshr · 2025-08-26T12:51:48Z

This should make DeepAptamer works with any dataset. Opinions?

Right.

This should make DeepAptamer works with any dataset. Opinions?

seems more logical too.

NennoMP

Various (minor) comments on docstrings.

One about passing device as a string (e.g., "cpu) and then using it in torch.tensor(...). I don't know whether this works as torch.tensor(...) expectds device to be a torch.device instance. Should check whether it works or an error is raised. Even if it works I think it would be better to make the parameter a torch.device instance (consistent with what pytorch has in its documentation).

pyaptamer/deepatamer/_model.py

pyaptamer/deepatamer/_pipeline.py

pyaptamer/utils/_deepaptamer_utils.py

satvshr · 2025-08-31T19:12:05Z

Will addressthe above comments in the thread, but the absolute features values like 4 and 126 can't be changed as they are fixed outputs, 4 because of ohe of the nucleotides into vectors of ACGT. 126 because that is the fixed shape output.

NennoMP · 2025-08-31T19:16:43Z

Will addressthe above comments in the thread, but the absolute features values like 4 and 126 can't be changed as they are fixed outputs, 4 because of ohe of the nucleotides into vectors of ACGT. 126 because that is the fixed shape output.

Ok, 4 being fixed makes sense but is it true also for 126. It can be either 126 or 138 no? Depending on whether the pipeline uses full_dna_shape = False or full_dna_shape = True. At least that's what I understood looking at the pipeline docstring.

satvshr · 2025-08-31T19:18:28Z

Ok, 4 being fixed makes sense but is it true also for 126. It can be either 126 or 138 no? Depending on whether the pipeline uses full_dna_shape = False or full_dna_shape = True. At least that's what I understood looking at the pipeline docstring.

Yes that is true, will edit the comment based on that.

satvshr · 2025-09-01T10:49:54Z

@NennoMP can you resolve old comments if they have been fixed now?

fkiraly

Looks ok, high-level - can you kindly ensure the new tests pass?

satvshr · 2025-09-12T06:27:46Z

can you kindly ensure the new tests pass?

Can you define "new tests" (ambiguous, ha!)? The test I wrote for deepaptamer passes locally.

fkiraly · 2025-09-15T07:28:25Z

look at the tests, they are failing due to import errors

satvshr · 2025-09-15T07:32:11Z

look at the tests, they are failing due to import errors

We edcided to park this until deepDNAshape gets replaced, so the error is going to be there for now. If you want to merge this I could soft import it and skip tests for now though

fkiraly · 2025-09-15T10:02:54Z

I see, so this is the point where we are going to park this PR, right?

satvshr · 2025-09-15T10:18:43Z

I see, so this is the point where we are going to park this PR, right?

If it is in a "mergable" state, yes.

satvshr added 2 commits August 9, 2025 14:35

Inital commit

d645542

Need to port the Attention class from keras

64e4c05

satvshr self-assigned this Aug 14, 2025

satvshr added the enhancement New feature or request label Aug 14, 2025

satvshr added 4 commits August 15, 2025 18:36

Merge branch 'main' into issue82

2d8518c

Update model.py

67223da

x.py code can be used in tests

d141e9b

Finalized net architecture

659e4d0

NennoMP self-requested a review August 16, 2025 18:56

NennoMP reviewed Aug 17, 2025

View reviewed changes

pyaptamer/deepatamer/model.py Outdated Show resolved Hide resolved

NennoMP reviewed Aug 17, 2025

View reviewed changes

pyaptamer/deepatamer/model.py Outdated Show resolved Hide resolved

pyaptamer/deepatamer/model.py Outdated Show resolved Hide resolved

Pipeline established, training left

65a2cc9

satvshr added 3 commits August 20, 2025 00:57

pre-comment cleanse commit

fb238e4

Ready for first review!

c85ce91

Edited docstrings

cfd1b41

satvshr requested a review from NennoMP August 19, 2025 20:30

satvshr mentioned this pull request Aug 21, 2025

[ENH] deepDNAshape reworked #110

Open

NennoMP reviewed Aug 21, 2025

View reviewed changes

satvshr marked this pull request as ready for review August 21, 2025 12:45

Removed training code

aa27ea6

satvshr requested a review from NennoMP August 25, 2025 19:26

NennoMP reviewed Aug 26, 2025

View reviewed changes

Made requested changes

908a5ad

satvshr requested a review from NennoMP August 26, 2025 11:21

NennoMP reviewed Aug 26, 2025

View reviewed changes

pyaptamer/deepatamer/_pipeline.py Outdated Show resolved Hide resolved

Update _pipeline.py

497dd29

satvshr requested a review from NennoMP August 26, 2025 12:55

NennoMP reviewed Aug 31, 2025

View reviewed changes

satvshr requested a review from NennoMP August 31, 2025 19:45

Addressed all comments

2ebe633

fkiraly requested changes Sep 11, 2025

View reviewed changes

[ENH] The DeepAptamer algorithm #98

Are you sure you want to change the base?

[ENH] The DeepAptamer algorithm #98

Uh oh!

Conversation

satvshr commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Aug 16, 2025

Uh oh!

NennoMP commented Aug 16, 2025

Uh oh!

Uh oh!

NennoMP left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

satvshr commented Aug 18, 2025

Uh oh!

NennoMP commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Aug 18, 2025

Uh oh!

satvshr commented Aug 18, 2025

Uh oh!

NennoMP commented Aug 18, 2025

Uh oh!

NennoMP commented Aug 18, 2025

Uh oh!

satvshr commented Aug 19, 2025

Uh oh!

satvshr commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satvshr commented Aug 21, 2025

Uh oh!

NennoMP left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NennoMP Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

satvshr Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

fkiraly Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satvshr Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satvshr Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

satvshr Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

NennoMP Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NennoMP commented Aug 25, 2025

Uh oh!

satvshr commented Aug 25, 2025

Uh oh!

NennoMP left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ENH] The `DeepAptamer` algorithm #98

[ENH] The `DeepAptamer` algorithm #98

satvshr commented Aug 14, 2025 •

edited

Loading

NennoMP left a comment •

edited

Loading

NennoMP commented Aug 18, 2025 •

edited

Loading

satvshr commented Aug 19, 2025 •

edited

Loading

fkiraly Aug 21, 2025 •

edited

Loading

satvshr Aug 21, 2025 •

edited

Loading

satvshr commented Aug 26, 2025 •

edited

Loading

NennoMP commented Aug 26, 2025 •

edited

Loading

NennoMP commented Aug 31, 2025 •

edited

Loading

satvshr commented Aug 31, 2025 •

edited

Loading

satvshr commented Sep 1, 2025 •

edited

Loading

satvshr commented Sep 12, 2025 •

edited

Loading