Fix melting functions and add shift_melt_cutsite_pairs by manulera · Pull Request #565 · pydna-group/pydna

manulera · 2026-02-27T10:18:42Z

Addresses #487 and fixes several errors in melting functions

Copilot

Pull request overview

This PR fixes critical bugs in DNA melting functions that were causing incorrect results when processing DNA sequences with single-stranded regions. The primary issue, as detailed in #487, was that the apply_cut function was being called with incorrect cutsite positions when dealing with melted DNA regions. The PR introduces a new shift_melt_cutsite_pairs method to properly adjust cutsite positions to account for single-stranded DNA regions, and adds comprehensive test coverage to ensure the fixes work correctly for both linear and circular sequences with various shift positions.

Changes:

Fixed spacer offset calculations in get_ss_meltsites and get_ds_meltsites (changed from addition to subtraction)
Added shift_melt_cutsite_pairs method to adjust cutsite positions for single-stranded DNA regions
Enhanced apply_cut with allow_overlap parameter to handle overlapping cuts in circular sequences

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
src/pydna/dseq.py	Added `shift_melt_cutsite_pairs` method, fixed `get_ss_meltsites` spacer bug, optimized `Dseq.shifted()`, added `CircularBytes.shifted()` and `replace_with()` methods, added `allow_overlap` parameter to `apply_cut`
src/pydna/alphabet.py	Updated `regex_ds_melt_factory` to accept `circular` parameter for correct pattern matching in linear vs circular sequences
src/pydna/utils.py	Added `deduplicate` utility function for removing duplicates while preserving order
src/pydna/cre_lox.py	Refactored to use new `deduplicate` utility function
tests/test_module_dseq.py	Added comprehensive tests for melting functions with circular sequences and various shift positions
tests/test_module_alphabet.py	Updated tests for `regex_ds_melt_factory` with new signature
.flake8	Added C901 (complexity) to global ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T10:30:19Z

src/pydna/dseq.py

+                            break
+                        ovhg -= 1
+                if is_circular:
+                    left_cut = ((watson % n, ovhg % n), enz)


The overhang value (ovhg) represents a signed distance between watson and crick cut positions. Applying modulo operation ovhg % n on line 2884 and 2906 can produce incorrect results for negative overhang values. For example, if ovhg = -4 and n = 11, then ovhg % n = 7, which changes the meaning of the overhang. The watson position should be modulo'd for circular sequences, but the overhang should remain as-is since it represents a relative offset, not an absolute position.

.flake8

Copilot · 2026-02-27T10:30:19Z

src/pydna/dseq.py

+        """Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them
+        so that they only comprise the dsDNA part.


The shift_melt_cutsite_pairs method lacks proper docstring documentation. While there's a brief one-line description, it's missing important details such as: what the method does conceptually (shifts cutsite pairs to account for single-stranded DNA regions), parameter descriptions, return value description, and examples. Given that this is a complex function dealing with DNA sequence manipulation, comprehensive documentation would help future maintainers understand its purpose and usage.

Suggested change

"""Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them

so that they only comprise the dsDNA part.

"""Shift cutsite pairs so that overhangs do not fall in ssDNA regions.

This method adjusts a list of cutsite pairs so that the resulting cuts

only span double-stranded (ds) parts of the sequence. It is intended

for use with ``Dseq`` objects that encode single-stranded (ss) regions

using the DNA alphabet from :mod:`pydna.alphabet` (see

:data:`ss_letters_watson` and :data:`ss_letters_crick`).

Each element in ``cutsite_pairs`` is a ``(left_cut, right_cut)`` tuple

representing the cuts delimiting a fragment. A single cut is either

``None`` or a ``((watson_index, ovhg), enzyme)`` tuple:

* ``watson_index`` is the position (0-based) on the Watson strand

where the restriction enzyme cuts.

* ``ovhg`` is the overhang length following the convention used in

:mod:`Bio.Restriction` (positive for 3' overhangs on the Watson

strand, negative for 5' overhangs, and zero for blunt ends).

* ``enzyme`` is the corresponding restriction enzyme object.

For each cut, this method "slides" the cut site across contiguous

ssDNA positions, expanding the overhang length where appropriate, so

that the resulting overhang is anchored entirely within dsDNA. For

circular sequences, indices are wrapped modulo the sequence length and

duplicate cutsite pairs introduced by wrapping are removed.

Parameters

----------

cutsite_pairs

List of ``(left_cut, right_cut)`` tuples to adjust. Each element

should match the structure returned by :meth:`get_cutsite_pairs`.

Returns

-------

List[Tuple[CutSiteType, CutSiteType]]

A new list of cutsite pairs with positions and overhangs shifted so

that overhang bases lie only within dsDNA regions. For circular

sequences, equivalent duplicate pairs are removed.

Notes

-----

The underlying ds/ss annotation is inferred from the encoded sequence

bytes: characters in :data:`ss_letters_watson` and

:data:`ss_letters_crick` are treated as single-stranded on their

respective strands; all other characters are considered part of

double-stranded regions.

Examples

--------

>>> from pydna.dseq import Dseq

>>> from Bio.Restriction import EcoRI

>>> d = Dseq("GAATTCnnnnGAATTC") # toy example with a small ss region

>>> rb = RestrictionBatch([EcoRI])

>>> cutsites = d.cut_with(rb)[EcoRI] # list of individual cutsites

>>> pairs = d.get_cutsite_pairs(cutsites)

>>> shifted_pairs = d.shift_melt_cutsite_pairs(pairs)

>>> isinstance(shifted_pairs, list)

True

Copilot · 2026-02-27T10:30:20Z

src/pydna/dseq.py


+    def replace_with(self, start, end, replacement) -> "CircularBytes":
+        """
+        Replace the subsequence between start and end with the replacement.


The CircularBytes.replace_with method lacks proper docstring documentation. The brief one-line description should be expanded to include: detailed parameter descriptions (start, end, replacement), return value description, behavior for wrapping cases (when end < start), and examples. The exception raised on line 182-184 should also be documented in a Raises section.

Suggested change

Replace the subsequence between start and end with the replacement.

Replace the subsequence between ``start`` and ``end`` with ``replacement``.

The indices are interpreted on the circular sequence, and the slice from

``start`` (inclusive) to ``end`` (exclusive) is replaced. If ``end < start``,

the subsequence is treated as wrapping around index 0, effectively using

``end + len(self)`` as the end index when computing the length of the

region to replace.

Parameters

----------

start : int

Start index (inclusive) of the subsequence to be replaced, in the

circular sequence. May be negative or larger than ``len(self)``,

in which case the effective index is taken modulo ``len(self)``.

end : int

End index (exclusive) of the subsequence to be replaced, in the

circular sequence. If ``end < start``, the region is interpreted as

wrapping around the origin (i.e. covering ``start .. len(self)`` and

then ``0 .. end``).

replacement : bytes or bytearray or memoryview

Bytes-like object used to replace the selected subsequence. Its

length must exactly match the length of the subsequence being

replaced.

Returns

-------

CircularBytes

A new :class:`CircularBytes` instance where the specified subsequence

has been replaced. The original sequence is not modified.

Raises

------

ValueError

If the length of ``replacement`` does not equal the length of the

subsequence defined by ``start`` and ``end`` (after handling any

wrapping when ``end < start``).

Examples

--------

>>> s = CircularBytes(b"ABCDE")

>>> s.replace_with(1, 3, b"xy")

CircularBytes(b'AxyDE')

>>> s.replace_with(3, 1, b"xyz") # wrapping replacement of b'DEA'

CircularBytes(b'zBCxy')

Copilot · 2026-02-27T10:30:20Z

src/pydna/dseq.py

+                "Replacement length must match the length of the subsequence"
+            )
+
+        shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)])


In the replace_with method, when reconstructing the CircularBytes after replacement, the slice uses shifted[len(replacement) : len(self)]. However, for consistency and correctness, this should use the calculated length end - start instead of len(replacement). While these should be equal due to the validation check on line 181-184, using end - start would be more explicit and maintainable. Consider changing to shifted[end - start : len(self)].

Suggested change

shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)])

shifted = CircularBytes(replacement + shifted[end - start : len(self)])

Copilot · 2026-02-27T10:30:20Z

src/pydna/dseq.py

+                        watson -= 1
+                        ovhg -= 1
+                if is_circular:
+                    right_cut = ((watson % n, ovhg % n), enz)


The overhang value (ovhg) represents a signed distance between watson and crick cut positions. Applying modulo operation ovhg % n can produce incorrect results for negative overhang values. For example, if ovhg = -4 and n = 11, then ovhg % n = 7, which changes the meaning of the overhang. The watson position should be modulo'd for circular sequences, but the overhang should remain as-is since it represents a relative offset, not an absolute position.

Copilot · 2026-02-27T10:30:21Z

src/pydna/dseq.py

+    def shifted(self, shift: int) -> "CircularBytes":
+        """
+        Shift the sequence by the given number of bases.
+        """
+        if shift % len(self) == 0:
+            return copy.deepcopy(self)
+        return CircularBytes(self[shift:] + self[:shift])


The CircularBytes.shifted method lacks proper docstring documentation. While there's a brief one-line description, it should include: parameter description (what 'shift' represents - positive/negative values, what happens with values larger than length), return value description, and usage examples. This is especially important since circular sequence shifting can be non-intuitive.

Copilot · 2026-02-27T10:30:21Z

src/pydna/utils.py

+def deduplicate(iterable):
+    """Remove duplicates from an iterable while preserving order.
+
+    >>> deduplicate([3, 1, 2, 1, 3, 4])
+    [3, 1, 2, 4]
+    >>> deduplicate([(1, 2), (3, 4), (1, 2)])
+    [(1, 2), (3, 4)]
+    """
+    seen = set()
+    result = []
+    for item in iterable:
+        if item not in seen:
+            seen.add(item)
+            result.append(item)
+    return result


The deduplicate function documentation shows it preserves order, but for large iterables with many duplicates, using a set for checking membership while building a list can have performance implications. Consider noting in the docstring that this function is O(n) time complexity with O(k) space complexity where k is the number of unique items, or documenting that it's intended for small to medium-sized iterables if performance isn't a concern.

tests/test_module_dseq.py

Copilot · 2026-02-27T10:30:21Z

src/pydna/alphabet.py

+    start_if_not_circular = "|^" if not circular else ""
+    end_if_not_circular = "|$" if not circular else ""
+
    regex = (
-        f"(?P<watson>((?<=[{ss_letters_watson}])|^)"
+        f"(?P<watson>((?<=[{ss_letters_watson}]){start_if_not_circular})"
        f"([{ds_letters}]{{1,{length}}})"
-        f"((?=[^{ss_letters_watson}{ds_letters}])|$))|"
-        f"(?P<crick>((?<=[{ss_letters_crick}])|^)"
+        f"((?=[^{ss_letters_watson}{ds_letters}]){end_if_not_circular}))|"
+        f"(?P<crick>((?<=[{ss_letters_crick}]){start_if_not_circular})"
        f"([{ds_letters}]{{1,{length}}})"
-        f"((?=[^{ss_letters_crick}{ds_letters}])|$))"
+        f"((?=[^{ss_letters_crick}{ds_letters}]){end_if_not_circular}))"


The Parameters section is incomplete. While it documents the length parameter, it's missing documentation for the new circular parameter. Add a description for the circular parameter explaining that it indicates whether the sequence is circular (True) or linear (False), and how this affects the regex pattern (linear sequences use start/end anchors ^ and $, while circular sequences do not).

codecov · 2026-02-27T10:38:01Z

Codecov Report

❌ Patch coverage is 94.95798% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/pydna/dseq.py	93.25%	3 Missing and 3 partials ⚠️

@@            Coverage Diff             @@
##           master     #565      +/-   ##
==========================================
+ Coverage   93.75%   93.86%   +0.11%     
==========================================
  Files          38       38              
  Lines        5424     5507      +83     
  Branches      763      786      +23     
==========================================
+ Hits         5085     5169      +84     
+ Misses        265      263       -2     
- Partials       74       75       +1

Files with missing lines	Coverage Δ
src/pydna/alphabet.py	`99.01% <100.00%> (+<0.01%)`	⬆️
src/pydna/cre_lox.py	`100.00% <100.00%> (ø)`
src/pydna/utils.py	`89.13% <100.00%> (+0.60%)`	⬆️
src/pydna/dseq.py	`92.30% <93.25%> (+1.04%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

BjornFJohansson · 2026-03-04T19:02:36Z

Nice work, but I did spot the case below:

from pydna.dseq import Dseq
seq = Dseq("AGEEGaGJJJg", circular=True)
seq.melt(3)

#       gAGAAGaG
# CtCGGGcTC

This should yield two separate ss fragments imho.

manulera · 2026-03-05T09:29:12Z

Good point, it could be dealt with with recursion, even within the function. That will complicate things slightly for the history (when this would be called from Dseqrecord), but I think it can be solved by an optional argument recursive=False

from pydna.dseq import Dseq

seq = Dseq("AGEEGaGJJJg", circular=True)

seq2 = seq.melt(3)[0]
seq3 = seq2.melt(3)[0]

print(repr(seq))
print(repr(seq2))
print(repr(seq3))

BjornFJohansson · 2026-03-05T14:15:51Z

Good point, it could be dealt with with recursion, even within the function. That will complicate things slightly for the history (when this would be called from Dseqrecord), but I think it can be solved by an optional argument recursive=False
from pydna.dseq import Dseq

seq = Dseq("AGEEGaGJJJg", circular=True)

seq2 = seq.melt(3)[0]
seq3 = seq2.melt(3)[0]

print(repr(seq))
print(repr(seq2))
print(repr(seq3))

I think this is a good idea, the function could melt until the function returns ().

manulera · 2026-03-09T00:22:43Z

I think this is ready to merge now @BjornFJohansson

manulera requested review from BjornFJohansson and Copilot February 27, 2026 10:18

Copilot started reviewing on behalf of manulera February 27, 2026 10:20 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

manulera added 10 commits March 5, 2026 16:08

fix test to support biopython 1.86

06a2fc5

working towards fixing #487

8d5e8c9

WIP

0e493fa

modify shifted to handle ssDNA

47c3604

fix get_ds_meltsites to work for circular seqs

987046c

closer to working in circular molecules

01e73f1

working for some circular sequences, not all

844bff8

getting there, but melt_ss_dna needs fixing for melt to work

9f27949

closes #487

a137558

fix docstring

d9c38df

manulera force-pushed the issue_487 branch from 246f47d to d9c38df Compare March 5, 2026 16:26

fix flake8 suggestions

b7e4025

manulera mentioned this pull request Mar 8, 2026

Difference between get_ss_meltsites and get_ds_meltsites #568

Closed

manulera added 3 commits March 8, 2026 22:19

simplify algorithms and improve documentation

91c3285

handle case where circular sequence is turned into two ssDNAs

3d38054

fix cuts_overlap

277d5e4

		"""Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them
		so that they only comprise the dsDNA part.

-        """Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them
-        so that they only comprise the dsDNA part.
+        """Shift cutsite pairs so that overhangs do not fall in ssDNA regions.
+        This method adjusts a list of cutsite pairs so that the resulting cuts
+        only span double-stranded (ds) parts of the sequence. It is intended
+        for use with ``Dseq`` objects that encode single-stranded (ss) regions
+        using the DNA alphabet from :mod:`pydna.alphabet` (see
+        :data:`ss_letters_watson` and :data:`ss_letters_crick`).
+        Each element in ``cutsite_pairs`` is a ``(left_cut, right_cut)`` tuple
+        representing the cuts delimiting a fragment. A single cut is either
+        ``None`` or a ``((watson_index, ovhg), enzyme)`` tuple:
+        * ``watson_index`` is the position (0-based) on the Watson strand
+          where the restriction enzyme cuts.
+        * ``ovhg`` is the overhang length following the convention used in
+          :mod:`Bio.Restriction` (positive for 3' overhangs on the Watson
+          strand, negative for 5' overhangs, and zero for blunt ends).
+        * ``enzyme`` is the corresponding restriction enzyme object.
+        For each cut, this method "slides" the cut site across contiguous
+        ssDNA positions, expanding the overhang length where appropriate, so
+        that the resulting overhang is anchored entirely within dsDNA. For
+        circular sequences, indices are wrapped modulo the sequence length and
+        duplicate cutsite pairs introduced by wrapping are removed.
+        Parameters
+        ----------
+        cutsite_pairs
+            List of ``(left_cut, right_cut)`` tuples to adjust. Each element
+            should match the structure returned by :meth:`get_cutsite_pairs`.
+        Returns
+        -------
+        List[Tuple[CutSiteType, CutSiteType]]
+            A new list of cutsite pairs with positions and overhangs shifted so
+            that overhang bases lie only within dsDNA regions. For circular
+            sequences, equivalent duplicate pairs are removed.
+        Notes
+        -----
+        The underlying ds/ss annotation is inferred from the encoded sequence
+        bytes: characters in :data:`ss_letters_watson` and
+        :data:`ss_letters_crick` are treated as single-stranded on their
+        respective strands; all other characters are considered part of
+        double-stranded regions.
+        Examples
+        --------
+        >>> from pydna.dseq import Dseq
+        >>> from Bio.Restriction import EcoRI
+        >>> d = Dseq("GAATTCnnnnGAATTC")  # toy example with a small ss region
+        >>> rb = RestrictionBatch([EcoRI])
+        >>> cutsites = d.cut_with(rb)[EcoRI]  # list of individual cutsites
+        >>> pairs = d.get_cutsite_pairs(cutsites)
+        >>> shifted_pairs = d.shift_melt_cutsite_pairs(pairs)
+        >>> isinstance(shifted_pairs, list)
+        True

-        Replace the subsequence between start and end with the replacement.
+        Replace the subsequence between ``start`` and ``end`` with ``replacement``.
+        The indices are interpreted on the circular sequence, and the slice from
+        ``start`` (inclusive) to ``end`` (exclusive) is replaced. If ``end < start``,
+        the subsequence is treated as wrapping around index 0, effectively using
+        ``end + len(self)`` as the end index when computing the length of the
+        region to replace.
+        Parameters
+        ----------
+        start : int
+            Start index (inclusive) of the subsequence to be replaced, in the
+            circular sequence. May be negative or larger than ``len(self)``,
+            in which case the effective index is taken modulo ``len(self)``.
+        end : int
+            End index (exclusive) of the subsequence to be replaced, in the
+            circular sequence. If ``end < start``, the region is interpreted as
+            wrapping around the origin (i.e. covering ``start .. len(self)`` and
+            then ``0 .. end``).
+        replacement : bytes or bytearray or memoryview
+            Bytes-like object used to replace the selected subsequence. Its
+            length must exactly match the length of the subsequence being
+            replaced.
+        Returns
+        -------
+        CircularBytes
+            A new :class:`CircularBytes` instance where the specified subsequence
+            has been replaced. The original sequence is not modified.
+        Raises
+        ------
+        ValueError
+            If the length of ``replacement`` does not equal the length of the
+            subsequence defined by ``start`` and ``end`` (after handling any
+            wrapping when ``end < start``).
+        Examples
+        --------
+        >>> s = CircularBytes(b"ABCDE")
+        >>> s.replace_with(1, 3, b"xy")
+        CircularBytes(b'AxyDE')
+        >>> s.replace_with(3, 1, b"xyz")  # wrapping replacement of b'DEA'
+        CircularBytes(b'zBCxy')

	shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)])
	shifted = CircularBytes(replacement + shifted[end - start : len(self)])

Conversation

manulera commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BjornFJohansson commented Mar 4, 2026

Uh oh!

manulera commented Mar 5, 2026

Uh oh!

BjornFJohansson commented Mar 5, 2026

Uh oh!

manulera commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Feb 27, 2026 •

edited

Loading