Skip to content

Fix melting functions and add shift_melt_cutsite_pairs#565

Open
manulera wants to merge 14 commits intomasterfrom
issue_487
Open

Fix melting functions and add shift_melt_cutsite_pairs#565
manulera wants to merge 14 commits intomasterfrom
issue_487

Conversation

@manulera
Copy link
Collaborator

Addresses #487 and fixes several errors in melting functions

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical bugs in DNA melting functions that were causing incorrect results when processing DNA sequences with single-stranded regions. The primary issue, as detailed in #487, was that the apply_cut function was being called with incorrect cutsite positions when dealing with melted DNA regions. The PR introduces a new shift_melt_cutsite_pairs method to properly adjust cutsite positions to account for single-stranded DNA regions, and adds comprehensive test coverage to ensure the fixes work correctly for both linear and circular sequences with various shift positions.

Changes:

  • Fixed spacer offset calculations in get_ss_meltsites and get_ds_meltsites (changed from addition to subtraction)
  • Added shift_melt_cutsite_pairs method to adjust cutsite positions for single-stranded DNA regions
  • Enhanced apply_cut with allow_overlap parameter to handle overlapping cuts in circular sequences

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/pydna/dseq.py Added shift_melt_cutsite_pairs method, fixed get_ss_meltsites spacer bug, optimized Dseq.shifted(), added CircularBytes.shifted() and replace_with() methods, added allow_overlap parameter to apply_cut
src/pydna/alphabet.py Updated regex_ds_melt_factory to accept circular parameter for correct pattern matching in linear vs circular sequences
src/pydna/utils.py Added deduplicate utility function for removing duplicates while preserving order
src/pydna/cre_lox.py Refactored to use new deduplicate utility function
tests/test_module_dseq.py Added comprehensive tests for melting functions with circular sequences and various shift positions
tests/test_module_alphabet.py Updated tests for regex_ds_melt_factory with new signature
.flake8 Added C901 (complexity) to global ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

break
ovhg -= 1
if is_circular:
left_cut = ((watson % n, ovhg % n), enz)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhang value (ovhg) represents a signed distance between watson and crick cut positions. Applying modulo operation ovhg % n on line 2884 and 2906 can produce incorrect results for negative overhang values. For example, if ovhg = -4 and n = 11, then ovhg % n = 7, which changes the meaning of the overhang. The watson position should be modulo'd for circular sequences, but the overhang should remain as-is since it represents a relative offset, not an absolute position.

Copilot uses AI. Check for mistakes.
Comment on lines +2855 to +2856
"""Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them
so that they only comprise the dsDNA part.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shift_melt_cutsite_pairs method lacks proper docstring documentation. While there's a brief one-line description, it's missing important details such as: what the method does conceptually (shifts cutsite pairs to account for single-stranded DNA regions), parameter descriptions, return value description, and examples. Given that this is a complex function dealing with DNA sequence manipulation, comprehensive documentation would help future maintainers understand its purpose and usage.

Suggested change
"""Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them
so that they only comprise the dsDNA part.
"""Shift cutsite pairs so that overhangs do not fall in ssDNA regions.
This method adjusts a list of cutsite pairs so that the resulting cuts
only span double-stranded (ds) parts of the sequence. It is intended
for use with ``Dseq`` objects that encode single-stranded (ss) regions
using the DNA alphabet from :mod:`pydna.alphabet` (see
:data:`ss_letters_watson` and :data:`ss_letters_crick`).
Each element in ``cutsite_pairs`` is a ``(left_cut, right_cut)`` tuple
representing the cuts delimiting a fragment. A single cut is either
``None`` or a ``((watson_index, ovhg), enzyme)`` tuple:
* ``watson_index`` is the position (0-based) on the Watson strand
where the restriction enzyme cuts.
* ``ovhg`` is the overhang length following the convention used in
:mod:`Bio.Restriction` (positive for 3' overhangs on the Watson
strand, negative for 5' overhangs, and zero for blunt ends).
* ``enzyme`` is the corresponding restriction enzyme object.
For each cut, this method "slides" the cut site across contiguous
ssDNA positions, expanding the overhang length where appropriate, so
that the resulting overhang is anchored entirely within dsDNA. For
circular sequences, indices are wrapped modulo the sequence length and
duplicate cutsite pairs introduced by wrapping are removed.
Parameters
----------
cutsite_pairs
List of ``(left_cut, right_cut)`` tuples to adjust. Each element
should match the structure returned by :meth:`get_cutsite_pairs`.
Returns
-------
List[Tuple[CutSiteType, CutSiteType]]
A new list of cutsite pairs with positions and overhangs shifted so
that overhang bases lie only within dsDNA regions. For circular
sequences, equivalent duplicate pairs are removed.
Notes
-----
The underlying ds/ss annotation is inferred from the encoded sequence
bytes: characters in :data:`ss_letters_watson` and
:data:`ss_letters_crick` are treated as single-stranded on their
respective strands; all other characters are considered part of
double-stranded regions.
Examples
--------
>>> from pydna.dseq import Dseq
>>> from Bio.Restriction import EcoRI
>>> d = Dseq("GAATTCnnnnGAATTC") # toy example with a small ss region
>>> rb = RestrictionBatch([EcoRI])
>>> cutsites = d.cut_with(rb)[EcoRI] # list of individual cutsites
>>> pairs = d.get_cutsite_pairs(cutsites)
>>> shifted_pairs = d.shift_melt_cutsite_pairs(pairs)
>>> isinstance(shifted_pairs, list)
True

Copilot uses AI. Check for mistakes.

def replace_with(self, start, end, replacement) -> "CircularBytes":
"""
Replace the subsequence between start and end with the replacement.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CircularBytes.replace_with method lacks proper docstring documentation. The brief one-line description should be expanded to include: detailed parameter descriptions (start, end, replacement), return value description, behavior for wrapping cases (when end < start), and examples. The exception raised on line 182-184 should also be documented in a Raises section.

Suggested change
Replace the subsequence between start and end with the replacement.
Replace the subsequence between ``start`` and ``end`` with ``replacement``.
The indices are interpreted on the circular sequence, and the slice from
``start`` (inclusive) to ``end`` (exclusive) is replaced. If ``end < start``,
the subsequence is treated as wrapping around index 0, effectively using
``end + len(self)`` as the end index when computing the length of the
region to replace.
Parameters
----------
start : int
Start index (inclusive) of the subsequence to be replaced, in the
circular sequence. May be negative or larger than ``len(self)``,
in which case the effective index is taken modulo ``len(self)``.
end : int
End index (exclusive) of the subsequence to be replaced, in the
circular sequence. If ``end < start``, the region is interpreted as
wrapping around the origin (i.e. covering ``start .. len(self)`` and
then ``0 .. end``).
replacement : bytes or bytearray or memoryview
Bytes-like object used to replace the selected subsequence. Its
length must exactly match the length of the subsequence being
replaced.
Returns
-------
CircularBytes
A new :class:`CircularBytes` instance where the specified subsequence
has been replaced. The original sequence is not modified.
Raises
------
ValueError
If the length of ``replacement`` does not equal the length of the
subsequence defined by ``start`` and ``end`` (after handling any
wrapping when ``end < start``).
Examples
--------
>>> s = CircularBytes(b"ABCDE")
>>> s.replace_with(1, 3, b"xy")
CircularBytes(b'AxyDE')
>>> s.replace_with(3, 1, b"xyz") # wrapping replacement of b'DEA'
CircularBytes(b'zBCxy')

Copilot uses AI. Check for mistakes.
"Replacement length must match the length of the subsequence"
)

shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)])
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the replace_with method, when reconstructing the CircularBytes after replacement, the slice uses shifted[len(replacement) : len(self)]. However, for consistency and correctness, this should use the calculated length end - start instead of len(replacement). While these should be equal due to the validation check on line 181-184, using end - start would be more explicit and maintainable. Consider changing to shifted[end - start : len(self)].

Suggested change
shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)])
shifted = CircularBytes(replacement + shifted[end - start : len(self)])

Copilot uses AI. Check for mistakes.
watson -= 1
ovhg -= 1
if is_circular:
right_cut = ((watson % n, ovhg % n), enz)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhang value (ovhg) represents a signed distance between watson and crick cut positions. Applying modulo operation ovhg % n can produce incorrect results for negative overhang values. For example, if ovhg = -4 and n = 11, then ovhg % n = 7, which changes the meaning of the overhang. The watson position should be modulo'd for circular sequences, but the overhang should remain as-is since it represents a relative offset, not an absolute position.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +77
def shifted(self, shift: int) -> "CircularBytes":
"""
Shift the sequence by the given number of bases.
"""
if shift % len(self) == 0:
return copy.deepcopy(self)
return CircularBytes(self[shift:] + self[:shift])
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CircularBytes.shifted method lacks proper docstring documentation. While there's a brief one-line description, it should include: parameter description (what 'shift' represents - positive/negative values, what happens with values larger than length), return value description, and usage examples. This is especially important since circular sequence shifting can be non-intuitive.

Copilot uses AI. Check for mistakes.
Comment on lines +884 to +898
def deduplicate(iterable):
"""Remove duplicates from an iterable while preserving order.

>>> deduplicate([3, 1, 2, 1, 3, 4])
[3, 1, 2, 4]
>>> deduplicate([(1, 2), (3, 4), (1, 2)])
[(1, 2), (3, 4)]
"""
seen = set()
result = []
for item in iterable:
if item not in seen:
seen.add(item)
result.append(item)
return result
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deduplicate function documentation shows it preserves order, but for large iterables with many duplicates, using a set for checking membership while building a list can have performance implications. Consider noting in the docstring that this function is O(n) time complexity with O(k) space complexity where k is the number of unique items, or documenting that it's intended for small to medium-sized iterables if performance isn't a concern.

Copilot uses AI. Check for mistakes.
Comment on lines +937 to +946
start_if_not_circular = "|^" if not circular else ""
end_if_not_circular = "|$" if not circular else ""

regex = (
f"(?P<watson>((?<=[{ss_letters_watson}])|^)"
f"(?P<watson>((?<=[{ss_letters_watson}]){start_if_not_circular})"
f"([{ds_letters}]{{1,{length}}})"
f"((?=[^{ss_letters_watson}{ds_letters}])|$))|"
f"(?P<crick>((?<=[{ss_letters_crick}])|^)"
f"((?=[^{ss_letters_watson}{ds_letters}]){end_if_not_circular}))|"
f"(?P<crick>((?<=[{ss_letters_crick}]){start_if_not_circular})"
f"([{ds_letters}]{{1,{length}}})"
f"((?=[^{ss_letters_crick}{ds_letters}])|$))"
f"((?=[^{ss_letters_crick}{ds_letters}]){end_if_not_circular}))"
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parameters section is incomplete. While it documents the length parameter, it's missing documentation for the new circular parameter. Add a description for the circular parameter explaining that it indicates whether the sequence is circular (True) or linear (False), and how this affects the regex pattern (linear sequences use start/end anchors ^ and $, while circular sequences do not).

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 94.95798% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pydna/dseq.py 93.25% 3 Missing and 3 partials ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #565      +/-   ##
==========================================
+ Coverage   93.75%   93.86%   +0.11%     
==========================================
  Files          38       38              
  Lines        5424     5507      +83     
  Branches      763      786      +23     
==========================================
+ Hits         5085     5169      +84     
+ Misses        265      263       -2     
- Partials       74       75       +1     
Files with missing lines Coverage Δ
src/pydna/alphabet.py 99.01% <100.00%> (+<0.01%) ⬆️
src/pydna/cre_lox.py 100.00% <100.00%> (ø)
src/pydna/utils.py 89.13% <100.00%> (+0.60%) ⬆️
src/pydna/dseq.py 92.30% <93.25%> (+1.04%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BjornFJohansson
Copy link
Collaborator

Nice work, but I did spot the case below:

from pydna.dseq import Dseq
seq = Dseq("AGEEGaGJJJg", circular=True)
seq.melt(3)

#       gAGAAGaG
# CtCGGGcTC

This should yield two separate ss fragments imho.

@manulera
Copy link
Collaborator Author

manulera commented Mar 5, 2026

Good point, it could be dealt with with recursion, even within the function. That will complicate things slightly for the history (when this would be called from Dseqrecord), but I think it can be solved by an optional argument recursive=False

from pydna.dseq import Dseq

seq = Dseq("AGEEGaGJJJg", circular=True)

seq2 = seq.melt(3)[0]
seq3 = seq2.melt(3)[0]

print(repr(seq))
print(repr(seq2))
print(repr(seq3))

@BjornFJohansson
Copy link
Collaborator

Good point, it could be dealt with with recursion, even within the function. That will complicate things slightly for the history (when this would be called from Dseqrecord), but I think it can be solved by an optional argument recursive=False

from pydna.dseq import Dseq

seq = Dseq("AGEEGaGJJJg", circular=True)

seq2 = seq.melt(3)[0]
seq3 = seq2.melt(3)[0]

print(repr(seq))
print(repr(seq2))
print(repr(seq3))

I think this is a good idea, the function could melt until the function returns ().

@manulera
Copy link
Collaborator Author

manulera commented Mar 9, 2026

I think this is ready to merge now @BjornFJohansson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants