Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes critical bugs in DNA melting functions that were causing incorrect results when processing DNA sequences with single-stranded regions. The primary issue, as detailed in #487, was that the apply_cut function was being called with incorrect cutsite positions when dealing with melted DNA regions. The PR introduces a new shift_melt_cutsite_pairs method to properly adjust cutsite positions to account for single-stranded DNA regions, and adds comprehensive test coverage to ensure the fixes work correctly for both linear and circular sequences with various shift positions.
Changes:
- Fixed spacer offset calculations in
get_ss_meltsitesandget_ds_meltsites(changed from addition to subtraction) - Added
shift_melt_cutsite_pairsmethod to adjust cutsite positions for single-stranded DNA regions - Enhanced
apply_cutwithallow_overlapparameter to handle overlapping cuts in circular sequences
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pydna/dseq.py | Added shift_melt_cutsite_pairs method, fixed get_ss_meltsites spacer bug, optimized Dseq.shifted(), added CircularBytes.shifted() and replace_with() methods, added allow_overlap parameter to apply_cut |
| src/pydna/alphabet.py | Updated regex_ds_melt_factory to accept circular parameter for correct pattern matching in linear vs circular sequences |
| src/pydna/utils.py | Added deduplicate utility function for removing duplicates while preserving order |
| src/pydna/cre_lox.py | Refactored to use new deduplicate utility function |
| tests/test_module_dseq.py | Added comprehensive tests for melting functions with circular sequences and various shift positions |
| tests/test_module_alphabet.py | Updated tests for regex_ds_melt_factory with new signature |
| .flake8 | Added C901 (complexity) to global ignore list |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/pydna/dseq.py
Outdated
| break | ||
| ovhg -= 1 | ||
| if is_circular: | ||
| left_cut = ((watson % n, ovhg % n), enz) |
There was a problem hiding this comment.
The overhang value (ovhg) represents a signed distance between watson and crick cut positions. Applying modulo operation ovhg % n on line 2884 and 2906 can produce incorrect results for negative overhang values. For example, if ovhg = -4 and n = 11, then ovhg % n = 7, which changes the meaning of the overhang. The watson position should be modulo'd for circular sequences, but the overhang should remain as-is since it represents a relative offset, not an absolute position.
| """Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them | ||
| so that they only comprise the dsDNA part. |
There was a problem hiding this comment.
The shift_melt_cutsite_pairs method lacks proper docstring documentation. While there's a brief one-line description, it's missing important details such as: what the method does conceptually (shifts cutsite pairs to account for single-stranded DNA regions), parameter descriptions, return value description, and examples. Given that this is a complex function dealing with DNA sequence manipulation, comprehensive documentation would help future maintainers understand its purpose and usage.
| """Takes a list of cutsite pairs that will be applied to a sequence with parts with ssDNA, and shifts them | |
| so that they only comprise the dsDNA part. | |
| """Shift cutsite pairs so that overhangs do not fall in ssDNA regions. | |
| This method adjusts a list of cutsite pairs so that the resulting cuts | |
| only span double-stranded (ds) parts of the sequence. It is intended | |
| for use with ``Dseq`` objects that encode single-stranded (ss) regions | |
| using the DNA alphabet from :mod:`pydna.alphabet` (see | |
| :data:`ss_letters_watson` and :data:`ss_letters_crick`). | |
| Each element in ``cutsite_pairs`` is a ``(left_cut, right_cut)`` tuple | |
| representing the cuts delimiting a fragment. A single cut is either | |
| ``None`` or a ``((watson_index, ovhg), enzyme)`` tuple: | |
| * ``watson_index`` is the position (0-based) on the Watson strand | |
| where the restriction enzyme cuts. | |
| * ``ovhg`` is the overhang length following the convention used in | |
| :mod:`Bio.Restriction` (positive for 3' overhangs on the Watson | |
| strand, negative for 5' overhangs, and zero for blunt ends). | |
| * ``enzyme`` is the corresponding restriction enzyme object. | |
| For each cut, this method "slides" the cut site across contiguous | |
| ssDNA positions, expanding the overhang length where appropriate, so | |
| that the resulting overhang is anchored entirely within dsDNA. For | |
| circular sequences, indices are wrapped modulo the sequence length and | |
| duplicate cutsite pairs introduced by wrapping are removed. | |
| Parameters | |
| ---------- | |
| cutsite_pairs | |
| List of ``(left_cut, right_cut)`` tuples to adjust. Each element | |
| should match the structure returned by :meth:`get_cutsite_pairs`. | |
| Returns | |
| ------- | |
| List[Tuple[CutSiteType, CutSiteType]] | |
| A new list of cutsite pairs with positions and overhangs shifted so | |
| that overhang bases lie only within dsDNA regions. For circular | |
| sequences, equivalent duplicate pairs are removed. | |
| Notes | |
| ----- | |
| The underlying ds/ss annotation is inferred from the encoded sequence | |
| bytes: characters in :data:`ss_letters_watson` and | |
| :data:`ss_letters_crick` are treated as single-stranded on their | |
| respective strands; all other characters are considered part of | |
| double-stranded regions. | |
| Examples | |
| -------- | |
| >>> from pydna.dseq import Dseq | |
| >>> from Bio.Restriction import EcoRI | |
| >>> d = Dseq("GAATTCnnnnGAATTC") # toy example with a small ss region | |
| >>> rb = RestrictionBatch([EcoRI]) | |
| >>> cutsites = d.cut_with(rb)[EcoRI] # list of individual cutsites | |
| >>> pairs = d.get_cutsite_pairs(cutsites) | |
| >>> shifted_pairs = d.shift_melt_cutsite_pairs(pairs) | |
| >>> isinstance(shifted_pairs, list) | |
| True |
|
|
||
| def replace_with(self, start, end, replacement) -> "CircularBytes": | ||
| """ | ||
| Replace the subsequence between start and end with the replacement. |
There was a problem hiding this comment.
The CircularBytes.replace_with method lacks proper docstring documentation. The brief one-line description should be expanded to include: detailed parameter descriptions (start, end, replacement), return value description, behavior for wrapping cases (when end < start), and examples. The exception raised on line 182-184 should also be documented in a Raises section.
| Replace the subsequence between start and end with the replacement. | |
| Replace the subsequence between ``start`` and ``end`` with ``replacement``. | |
| The indices are interpreted on the circular sequence, and the slice from | |
| ``start`` (inclusive) to ``end`` (exclusive) is replaced. If ``end < start``, | |
| the subsequence is treated as wrapping around index 0, effectively using | |
| ``end + len(self)`` as the end index when computing the length of the | |
| region to replace. | |
| Parameters | |
| ---------- | |
| start : int | |
| Start index (inclusive) of the subsequence to be replaced, in the | |
| circular sequence. May be negative or larger than ``len(self)``, | |
| in which case the effective index is taken modulo ``len(self)``. | |
| end : int | |
| End index (exclusive) of the subsequence to be replaced, in the | |
| circular sequence. If ``end < start``, the region is interpreted as | |
| wrapping around the origin (i.e. covering ``start .. len(self)`` and | |
| then ``0 .. end``). | |
| replacement : bytes or bytearray or memoryview | |
| Bytes-like object used to replace the selected subsequence. Its | |
| length must exactly match the length of the subsequence being | |
| replaced. | |
| Returns | |
| ------- | |
| CircularBytes | |
| A new :class:`CircularBytes` instance where the specified subsequence | |
| has been replaced. The original sequence is not modified. | |
| Raises | |
| ------ | |
| ValueError | |
| If the length of ``replacement`` does not equal the length of the | |
| subsequence defined by ``start`` and ``end`` (after handling any | |
| wrapping when ``end < start``). | |
| Examples | |
| -------- | |
| >>> s = CircularBytes(b"ABCDE") | |
| >>> s.replace_with(1, 3, b"xy") | |
| CircularBytes(b'AxyDE') | |
| >>> s.replace_with(3, 1, b"xyz") # wrapping replacement of b'DEA' | |
| CircularBytes(b'zBCxy') |
| "Replacement length must match the length of the subsequence" | ||
| ) | ||
|
|
||
| shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)]) |
There was a problem hiding this comment.
In the replace_with method, when reconstructing the CircularBytes after replacement, the slice uses shifted[len(replacement) : len(self)]. However, for consistency and correctness, this should use the calculated length end - start instead of len(replacement). While these should be equal due to the validation check on line 181-184, using end - start would be more explicit and maintainable. Consider changing to shifted[end - start : len(self)].
| shifted = CircularBytes(replacement + shifted[len(replacement) : len(self)]) | |
| shifted = CircularBytes(replacement + shifted[end - start : len(self)]) |
src/pydna/dseq.py
Outdated
| watson -= 1 | ||
| ovhg -= 1 | ||
| if is_circular: | ||
| right_cut = ((watson % n, ovhg % n), enz) |
There was a problem hiding this comment.
The overhang value (ovhg) represents a signed distance between watson and crick cut positions. Applying modulo operation ovhg % n can produce incorrect results for negative overhang values. For example, if ovhg = -4 and n = 11, then ovhg % n = 7, which changes the meaning of the overhang. The watson position should be modulo'd for circular sequences, but the overhang should remain as-is since it represents a relative offset, not an absolute position.
| def shifted(self, shift: int) -> "CircularBytes": | ||
| """ | ||
| Shift the sequence by the given number of bases. | ||
| """ | ||
| if shift % len(self) == 0: | ||
| return copy.deepcopy(self) | ||
| return CircularBytes(self[shift:] + self[:shift]) |
There was a problem hiding this comment.
The CircularBytes.shifted method lacks proper docstring documentation. While there's a brief one-line description, it should include: parameter description (what 'shift' represents - positive/negative values, what happens with values larger than length), return value description, and usage examples. This is especially important since circular sequence shifting can be non-intuitive.
| def deduplicate(iterable): | ||
| """Remove duplicates from an iterable while preserving order. | ||
|
|
||
| >>> deduplicate([3, 1, 2, 1, 3, 4]) | ||
| [3, 1, 2, 4] | ||
| >>> deduplicate([(1, 2), (3, 4), (1, 2)]) | ||
| [(1, 2), (3, 4)] | ||
| """ | ||
| seen = set() | ||
| result = [] | ||
| for item in iterable: | ||
| if item not in seen: | ||
| seen.add(item) | ||
| result.append(item) | ||
| return result |
There was a problem hiding this comment.
The deduplicate function documentation shows it preserves order, but for large iterables with many duplicates, using a set for checking membership while building a list can have performance implications. Consider noting in the docstring that this function is O(n) time complexity with O(k) space complexity where k is the number of unique items, or documenting that it's intended for small to medium-sized iterables if performance isn't a concern.
src/pydna/alphabet.py
Outdated
| start_if_not_circular = "|^" if not circular else "" | ||
| end_if_not_circular = "|$" if not circular else "" | ||
|
|
||
| regex = ( | ||
| f"(?P<watson>((?<=[{ss_letters_watson}])|^)" | ||
| f"(?P<watson>((?<=[{ss_letters_watson}]){start_if_not_circular})" | ||
| f"([{ds_letters}]{{1,{length}}})" | ||
| f"((?=[^{ss_letters_watson}{ds_letters}])|$))|" | ||
| f"(?P<crick>((?<=[{ss_letters_crick}])|^)" | ||
| f"((?=[^{ss_letters_watson}{ds_letters}]){end_if_not_circular}))|" | ||
| f"(?P<crick>((?<=[{ss_letters_crick}]){start_if_not_circular})" | ||
| f"([{ds_letters}]{{1,{length}}})" | ||
| f"((?=[^{ss_letters_crick}{ds_letters}])|$))" | ||
| f"((?=[^{ss_letters_crick}{ds_letters}]){end_if_not_circular}))" |
There was a problem hiding this comment.
The Parameters section is incomplete. While it documents the length parameter, it's missing documentation for the new circular parameter. Add a description for the circular parameter explaining that it indicates whether the sequence is circular (True) or linear (False), and how this affects the regex pattern (linear sequences use start/end anchors ^ and $, while circular sequences do not).
Codecov Report❌ Patch coverage is
@@ Coverage Diff @@
## master #565 +/- ##
==========================================
+ Coverage 93.75% 93.86% +0.11%
==========================================
Files 38 38
Lines 5424 5507 +83
Branches 763 786 +23
==========================================
+ Hits 5085 5169 +84
+ Misses 265 263 -2
- Partials 74 75 +1
🚀 New features to boost your workflow:
|
|
Nice work, but I did spot the case below: This should yield two separate ss fragments imho. |
|
Good point, it could be dealt with with recursion, even within the function. That will complicate things slightly for the history (when this would be called from Dseqrecord), but I think it can be solved by an optional argument |
I think this is a good idea, the function could melt until the function returns (). |
|
I think this is ready to merge now @BjornFJohansson |
Addresses #487 and fixes several errors in melting functions