Improve delimiter chunking performance and governor checks for Python 3.13 CI by Copilot · Pull Request #335 · knitli/codeweaver

Copilot · 2026-04-12T16:40:01Z

Pull request created by AI Agent

Summary by Sourcery

Improve delimiter-based chunking performance and timeout behavior when scanning large source files.

Enhancements:

Pass the resource governor into delimiter matching to allow timeout checks between matching phases.
Precompute brace nesting levels for keyword delimiter positions in a single pass to avoid quadratic-time scans on large files.
Reuse existing string/comment parsing helpers so nesting calculations and delimiter matching stay consistent.
Add a unit test to assert the mid-phase governor timeout check is invoked between explicit and keyword matching.

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/4434677c-67a3-47ee-82d6-dd1290f8b94c Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

sourcery-ai · 2026-04-12T16:40:32Z

Reviewer's Guide

Threads an optional resource governor into delimiter matching and replaces per-keyword nesting recomputation with a single-pass precomputation to make keyword delimiter matching linear-time and more timeout-safe.

Sequence diagram for delimiter matching with resource governor and nesting precomputation

sequenceDiagram
    actor Governor
    participant Caller
    participant DelimiterChunker
    participant KeywordMatcher as _match_keyword_delimiters
    participant NestingPrecomp as _precompute_nesting_levels

    Caller->>DelimiterChunker: _get_matches_with_fallback(content, governor)
    DelimiterChunker->>Governor: check_timeout()
    DelimiterChunker->>DelimiterChunker: _find_delimiter_matches(content, governor)

    rect rgb(235, 245, 255)
        DelimiterChunker->>DelimiterChunker: _match_explicit_delimiters(content)
    end

    alt governor provided
        DelimiterChunker->>Governor: check_timeout()
    end

    rect rgb(235, 245, 255)
        DelimiterChunker->>KeywordMatcher: _match_keyword_delimiters(content, keyword_delimiters)
        KeywordMatcher->>NestingPrecomp: _precompute_nesting_levels(content, keyword_positions)
        NestingPrecomp-->>KeywordMatcher: nesting_at
        loop for each keyword match
            KeywordMatcher->>KeywordMatcher: lookup nesting_at[keyword_pos]
            KeywordMatcher->>KeywordMatcher: build DelimiterMatch using struct_end and nesting_level
        end
    end

    DelimiterChunker-->>Caller: matches

    opt no matches
        DelimiterChunker->>DelimiterChunker: _fallback_paragraph_chunking(content)
        DelimiterChunker-->>Caller: fallback matches
    end

Updated class diagram for delimiter keyword matching and nesting helpers

classDiagram
    class DelimiterChunker {
        +_get_matches_with_fallback(content: str, governor: Any) list~DelimiterMatch~
        +_enforce_chunk_limit(chunks: list~CodeChunk~, file_path: Path | None) void
        +_find_delimiter_matches(content: str, governor: Any | None) list~DelimiterMatch~
        +_match_keyword_delimiters(content: str, keyword_delimiters: list~Delimiter~) list~DelimiterMatch~
        +_precompute_nesting_levels(content: str, positions: list~int~) dict~int, int~
        +_calculate_nesting_level(content: str, pos: int) int
        +_toggle_string(c: str, in_string: bool, string_char: str | None) tuple~bool, str | None~
        +_skip_nesting_comment(content: str, i: int, c: str, content_len: int) int | None
        +_adjust_brace_depth(c: str, depth: int) int
    }

    class ResourceGovernor {
        +check_timeout() void
    }

    class DelimiterMatch
    class CodeChunk
    class Delimiter
    class Path

    DelimiterChunker ..> ResourceGovernor : uses
    DelimiterChunker ..> DelimiterMatch : creates
    DelimiterChunker ..> CodeChunk : manages
    DelimiterChunker ..> Delimiter : matches
    DelimiterChunker ..> Path : reads_path

File-Level Changes

Change	Details	Files
Propagate an optional resource governor into delimiter matching and add an inter-phase timeout check.	Update _get_matches_with_fallback to pass the governor into _find_delimiter_matches Extend _find_delimiter_matches signature to accept a keyword-only governor parameter Insert a governor.check_timeout() call between explicit delimiter and keyword delimiter matching phases	`src/codeweaver/engine/chunker/delimiter.py`
Optimize keyword delimiter matching by precomputing brace nesting levels in a single pass instead of per-keyword scans.	Precompute keyword positions from the combined keyword regex and derive a position-to-nesting map via a new _precompute_nesting_levels helper Use precomputed nesting levels in _match_keyword_delimiters instead of calling _calculate_nesting_level for each match Introduce helper utilities _toggle_string, _skip_nesting_comment, and _adjust_brace_depth to support efficient, comment- and string-aware nesting computation	`src/codeweaver/engine/chunker/delimiter.py`

Possibly linked issues

#(not specified): PR optimizes chunker delimiter keyword matching and adds timeout checks, directly targeting the 3.13 performance/timeout regressions.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

github-actions · 2026-04-12T16:45:33Z

🤖 Hi @Copilot, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-04-12T16:46:09Z

🤖 I'm sorry @Copilot, but I was unable to process your request. Please see the logs for more details.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new string-tracking logic in _precompute_nesting_levels only checks content[i - 1] != "\\" to detect escaped quotes, which will mis-handle cases like \\" (even number of backslashes); consider using a parity check on consecutive backslashes so you don't incorrectly toggle in_string inside escaped quotes.
You call combined_pattern.finditer(content) twice in _match_keyword_delimiters (once to build keyword_positions and once in the main loop); consider collecting the matches in a list so you can both precompute positions and iterate once, avoiding a second regex pass over the content.
The new governor parameter on _find_delimiter_matches is typed as Any | None; if the governor interface is stable, tightening this to a protocol or concrete type with check_timeout() would make the contract clearer and prevent accidental misuse.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new string-tracking logic in `_precompute_nesting_levels` only checks `content[i - 1] != "\\"` to detect escaped quotes, which will mis-handle cases like `\\"` (even number of backslashes); consider using a parity check on consecutive backslashes so you don't incorrectly toggle `in_string` inside escaped quotes.
- You call `combined_pattern.finditer(content)` twice in `_match_keyword_delimiters` (once to build `keyword_positions` and once in the main loop); consider collecting the matches in a list so you can both precompute positions and iterate once, avoiding a second regex pass over the content.
- The new `governor` parameter on `_find_delimiter_matches` is typed as `Any | None`; if the governor interface is stable, tightening this to a protocol or concrete type with `check_timeout()` would make the contract clearer and prevent accidental misuse.

## Individual Comments

### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="485-489" />
<code_context>
         for delimiter in keyword_delimiters:
             delimiter_map.setdefault(delimiter.start, []).append(delimiter)

+        # Precompute brace-nesting levels at all keyword positions in a single
+        # O(n) forward pass.  The previous approach called _calculate_nesting_level
+        # per keyword match, each scanning from position 0, resulting in O(n * m)
+        # total work that caused timeouts on large files (especially Python 3.13).
+        keyword_positions = [m.start() for m in combined_pattern.finditer(content)]
+        nesting_at = self._precompute_nesting_levels(content, keyword_positions)
+
</code_context>
<issue_to_address>
**suggestion (performance):** Avoid running the same regex twice by caching matches before computing positions and iterating.

`combined_pattern.finditer(content)` is currently called once to build `keyword_positions` and again in the main loop, doubling regex work on large files. Consider calling `finditer` once, storing the matches in a list, deriving `keyword_positions` from that list, and then reusing the same list when constructing `DelimiterMatch` instances to avoid the extra pass.

Suggested implementation:

```python
        # Precompute brace-nesting levels at all keyword positions in a single
        # O(n) forward pass.  The previous approach called _calculate_nesting_level
        # per keyword match, each scanning from position 0, resulting in O(n * m)
        # total work that caused timeouts on large files (especially Python 3.13).
        # Materialize matches once so we can reuse them both for position
        # computation and when constructing DelimiterMatch instances.
        keyword_matches = list(combined_pattern.finditer(content))
        keyword_positions = [m.start() for m in keyword_matches]
        nesting_at = self._precompute_nesting_levels(content, keyword_positions)

```

```python
        for match in keyword_matches:

```

The second SEARCH block assumes the main keyword-matching loop literally uses `for match in combined_pattern.finditer(content):`. If the surrounding code differs (e.g. assigns the iterator to a variable first or uses a comprehension), adjust that SEARCH pattern to replace whatever second invocation of `combined_pattern.finditer(content)` is used to iterate over keyword matches with `keyword_matches` instead. Ensure that:
1. The declaration of `keyword_matches = list(combined_pattern.finditer(content))` is in the same scope and precedes any use of `keyword_matches`.
2. No other code path relies on `combined_pattern.finditer(content)` being a fresh iterator (in typical usage it does not, so replacing with the precomputed list is safe).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

github-actions

Code Review for PR #335

I've reviewed the performance optimization changes for delimiter-based chunking. This PR addresses Python 3.13 CI timeout issues by optimizing the keyword delimiter matching algorithm from O(n*m) to O(n+m) complexity. Overall, the changes are well-structured and achieve their goal, but there are several important improvements needed.

High Priority Issues

1. Escaped Quote Handling Bug ⚠️

Location: src/codeweaver/engine/chunker/delimiter.py:575

The escape detection logic in _precompute_nesting_levels is incorrect:

if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):

Problem: This fails for cases like \\" (even number of backslashes). The string "test\\" would incorrectly toggle the string state because content[i-1] is \, but this is actually an escaped backslash, not an escape for the quote.

Fix: Count consecutive backslashes to determine if the quote is truly escaped:

def _is_quote_escaped(content: str, pos: int) -> bool:
    """Check if quote at position is escaped by counting backslashes."""
    if pos == 0:
        return False
    
    # Count consecutive backslashes before quote
    backslash_count = 0
    check_pos = pos - 1
    while check_pos >= 0 and content[check_pos] == "\\":
        backslash_count += 1
        check_pos -= 1
    
    # Odd number of backslashes means the quote is escaped
    return backslash_count % 2 == 1

# Then use:
if c in ('"', "'", "`") and not self._is_quote_escaped(content, i):

This same bug exists in other methods (_calculate_nesting_level:647, _update_string_state:754, _process_string_in_matching:927). All should be fixed consistently.

2. Duplicate Regex Execution 🔄

Location: src/codeweaver/engine/chunker/delimiter.py:489-492

The code calls combined_pattern.finditer(content) twice:

Line 489: To extract positions for _precompute_nesting_levels
Line 492: To iterate and create matches

Impact: On large files, regex matching can be expensive, effectively doubling the regex work.

Fix: Materialize matches once:

# Materialize matches once to avoid duplicate regex work
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:  # Use materialized matches
    matched_text = match.group(0)
    # ... rest of loop

3. Type Safety: Governor Parameter 📝

Location: src/codeweaver/engine/chunker/delimiter.py:344

The governor parameter is typed as Any | None, which defeats type checking benefits.

Current:

def _find_delimiter_matches(
    self, content: str, *, governor: Any | None = None
) -> list[DelimiterMatch]:

Recommended: Create a Protocol for the governor interface:

from typing import Protocol

class TimeoutCheckable(Protocol):
    """Protocol for objects that support timeout checking."""
    def check_timeout(self) -> None: ...

def _find_delimiter_matches(
    self, content: str, *, governor: TimeoutCheckable | None = None
) -> list[DelimiterMatch]:

Looking at governance.py:33-106, ResourceGovernor has a check_timeout() method, so this protocol would be satisfied. This provides better type safety and makes the contract explicit.

Positive Aspects ✅

Performance Impact: The O(n*m) → O(n+m) optimization is excellent and directly addresses the timeout issue
Clean Helper Methods: _toggle_string, _skip_nesting_comment, and _adjust_brace_depth are well-factored
Documentation: The docstring for _precompute_nesting_levels clearly explains the optimization rationale
Timeout Checks: Properly adding governor.check_timeout() between phases prevents unbounded work

Additional Recommendations

Test Coverage: Ensure tests cover:
- Escaped quotes: "test\\", 'test\\\\', mixed escaping
- Large files that previously timed out
- Edge cases in nesting precomputation
Consider Multi-character String Delimiters: The current implementation only handles single-character quotes. Languages like Python support """ (triple quotes). This could be a follow-up enhancement.

CI Status

The review job is currently failing. Tests for Python 3.12, 3.13, and 3.14 are pending. I recommend:

Addressing the issues above
Waiting for CI to complete to ensure the fixes resolve the timeout issues

Summary

The core optimization strategy is sound and should resolve the Python 3.13 performance issues. However, the escaped quote handling bug and duplicate regex execution should be fixed before merging to ensure correctness and maximize the performance gains.

Recommendation: Request changes to address the three high-priority issues before approval.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR targets delimiter-based chunking performance and timeout behavior, motivated by Python 3.13 CI timeouts/infinite-loop symptoms, by introducing resource-governed execution and reducing keyword delimiter matching complexity.

Changes:

Thread an optional governor into delimiter match discovery and add a mid-phase check_timeout() between explicit and keyword matching.
Reduce keyword delimiter nesting computation from repeated rescans to a single forward pass with O(1) nesting lookup per keyword match.
Add helper utilities to support the new single-pass nesting precomputation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def _precompute_nesting_levels(
+        self, content: str, positions: list[int]
+    ) -> dict[int, int]:
+        """Precompute brace-nesting levels at given positions in a single forward pass.
+
+        Replaces per-position calls to ``_calculate_nesting_level`` which each
+        scanned from position 0, yielding O(n * m) total work.  This method
+        achieves the same result in O(n + m) by walking the content once and
+        recording the running brace depth at each requested position.
+
+        Args:
+            content: Source code
+            positions: Character offsets whose nesting level is needed
+
+        Returns:
+            Mapping from position to nesting level (0 = top-level)
+        """
+        if not positions:
+            return {}
+
+        result: dict[int, int] = {}
+        sorted_positions = sorted(positions)
+        pos_idx = 0
+        brace_depth = 0
+        in_string = False
+        string_char: str | None = None
+        content_len = len(content)
+        i = 0
+
+        while i < content_len:
+            # Record nesting level for every target position we have reached
+            while pos_idx < len(sorted_positions) and sorted_positions[pos_idx] <= i:
+                result[sorted_positions[pos_idx]] = brace_depth
+                pos_idx += 1
+
+            if pos_idx >= len(sorted_positions):
+                break  # All positions recorded
+
+            c = content[i]
+
+            # Track string boundaries
+            if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):
+                in_string, string_char = self._toggle_string(
+                    c, in_string=in_string, string_char=string_char
+                )
+            elif not in_string:
+                skip_to = self._skip_nesting_comment(content, i, c, content_len)
+                if skip_to is not None:
+                    i = skip_to
+                    continue
+                brace_depth = self._adjust_brace_depth(c, brace_depth)
+
+            i += 1
+
+        # Any remaining positions beyond the end of content
+        for p in sorted_positions[pos_idx:]:
+            result[p] = brace_depth
+
+        return result
+
+    @staticmethod
+    def _toggle_string(
+        c: str, *, in_string: bool, string_char: str | None
+    ) -> tuple[bool, str | None]:
+        """Toggle string state for quote character."""
+        if not in_string:
+            return True, c
+        if c == string_char:
+            return False, None
+        return in_string, string_char
+
+    @staticmethod
+    def _skip_nesting_comment(content: str, i: int, c: str, content_len: int) -> int | None:
+        """Return new index if a comment starts at *i*, else None."""
+        two_chars = content[i : i + 2]
+        if two_chars == "//" or c == "#":
+            nl = content.find("\n", i)
+            return nl if nl >= 0 else content_len
+        if two_chars == "/*":
+            end = content.find("*/", i + 2)
+            return end + 2 if end >= 0 else content_len
+        return None
+


@@ -501,8 +515,8 @@ def _match_keyword_delimiters(
                )

                if struct_end is not None:
-                    # Calculate nesting level by counting parent structures
-                    nesting_level = self._calculate_nesting_level(content, keyword_pos)
+                    # Look up precomputed nesting level (O(1) per keyword)
+                    nesting_level = nesting_at.get(keyword_pos, 0)

                    # Create a complete match from keyword to closing structure
                    # This represents the entire construct (e.g., function...})
@@ -517,6 +531,98 @@ def _match_keyword_delimiters(

        return matches

+    def _precompute_nesting_levels(
+        self, content: str, positions: list[int]
+    ) -> dict[int, int]:
+        """Precompute brace-nesting levels at given positions in a single forward pass.
+
+        Replaces per-position calls to ``_calculate_nesting_level`` which each
+        scanned from position 0, yielding O(n * m) total work.  This method
+        achieves the same result in O(n + m) by walking the content once and
+        recording the running brace depth at each requested position.
+
+        Args:
+            content: Source code
+            positions: Character offsets whose nesting level is needed
+
+        Returns:
+            Mapping from position to nesting level (0 = top-level)
+        """
+        if not positions:
+            return {}
+
+        result: dict[int, int] = {}
+        sorted_positions = sorted(positions)
+        pos_idx = 0
+        brace_depth = 0
+        in_string = False
+        string_char: str | None = None
+        content_len = len(content)
+        i = 0
+
+        while i < content_len:
+            # Record nesting level for every target position we have reached
+            while pos_idx < len(sorted_positions) and sorted_positions[pos_idx] <= i:
+                result[sorted_positions[pos_idx]] = brace_depth
+                pos_idx += 1
+
+            if pos_idx >= len(sorted_positions):
+                break  # All positions recorded
+
+            c = content[i]
+
+            # Track string boundaries
+            if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):
+                in_string, string_char = self._toggle_string(
+                    c, in_string=in_string, string_char=string_char
+                )
+            elif not in_string:
+                skip_to = self._skip_nesting_comment(content, i, c, content_len)
+                if skip_to is not None:
+                    i = skip_to
+                    continue
+                brace_depth = self._adjust_brace_depth(c, brace_depth)
+
+            i += 1
+
+        # Any remaining positions beyond the end of content
+        for p in sorted_positions[pos_idx:]:
+            result[p] = brace_depth
+
+        return result
+
+    @staticmethod
+    def _toggle_string(
+        c: str, *, in_string: bool, string_char: str | None
+    ) -> tuple[bool, str | None]:
+        """Toggle string state for quote character."""
+        if not in_string:
+            return True, c
+        if c == string_char:
+            return False, None
+        return in_string, string_char
+
+    @staticmethod
+    def _skip_nesting_comment(content: str, i: int, c: str, content_len: int) -> int | None:
+        """Return new index if a comment starts at *i*, else None."""
+        two_chars = content[i : i + 2]
+        if two_chars == "//" or c == "#":
+            nl = content.find("\n", i)
+            return nl if nl >= 0 else content_len
+        if two_chars == "/*":
+            end = content.find("*/", i + 2)
+            return end + 2 if end >= 0 else content_len
+        return None
+
+    @staticmethod
+    def _adjust_brace_depth(c: str, depth: int) -> int:
+        """Adjust brace depth for open/close brace characters."""
+        if c == "{":
+            return depth + 1
+        if c == "}":
+            return max(0, depth - 1)
+        return depth
+
    def _calculate_nesting_level(self, content: str, pos: int) -> int:


+        # Precompute brace-nesting levels at all keyword positions in a single
+        # O(n) forward pass.  The previous approach called _calculate_nesting_level
+        # per keyword match, each scanning from position 0, resulting in O(n * m)
+        # total work that caused timeouts on large files (especially Python 3.13).
+        keyword_positions = [m.start() for m in combined_pattern.finditer(content)]
+        nesting_at = self._precompute_nesting_levels(content, keyword_positions)
+
        for match in combined_pattern.finditer(content):


@@ -364,6 +367,10 @@ def _find_delimiter_matches(self, content: str) -> list[DelimiterMatch]:
        # Phase 1: Handle explicit start/end pairs (existing logic)
        matches.extend(self._match_explicit_delimiters(content, explicit_delimiters))

+        # Check timeout between phases to avoid unbounded work
+        if governor is not None:
+            governor.check_timeout()
+
        # Phase 2: Handle keyword delimiters with empty ends
        matches.extend(self._match_keyword_delimiters(content, keyword_delimiters))


github-actions

Code Review Summary

I've reviewed PR #335 which aims to investigate and fix an infinite loop issue affecting Python 3.13 CI. The changes focus on performance optimizations in the delimiter-based chunking system.

Positive Aspects

Performance Optimization: The PR successfully addresses O(n * m) complexity by precomputing brace-nesting levels in a single O(n + m) forward pass via .
Timeout Management: Adding between phases (lines 370-372) prevents unbounded work on large files.
Code Organization: New helper methods (, , ) improve code readability and maintainability.
Well-Documented: The implementation includes clear comments explaining the algorithm and complexity improvements.

Issue to Address

Double Regex Scan (as noted in comment #3069768586):
Lines 489-492 execute combined_pattern.finditer(content) twice:

Once to build keyword_positions
Again in the main for loop

For very large files, this adds an extra full regex scan. The fix is straightforward:

# Materialize matches once to avoid double regex scan
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:

This change would maintain the performance benefits while eliminating the redundant regex execution.

Other Observations

CI Status: The review / review check is currently failing. The tests for Python 3.12, 3.13, and 3.14 are pending.
Code Correctness: The algorithm logic appears sound - the helper methods correctly handle:
- String state tracking with escape handling
- Comment skipping (single-line // and #, multi-line /* */)
- Brace depth management
Edge Cases: The implementation handles edge cases like positions beyond content end (line 589).

Recommendation

Please address the double regex scan issue mentioned above. Once that's fixed and the CI checks pass (particularly the Python 3.13 tests), this PR should be ready to merge.

The performance improvements are well-designed and should significantly reduce timeout issues on large files.

github-actions

Code Review Summary

I've reviewed PR #335 which aims to investigate and fix an infinite loop issue affecting Python 3.13 CI. The changes focus on performance optimizations in the delimiter-based chunking system.

Positive Aspects

Performance Optimization: The PR successfully addresses O(n * m) complexity by precomputing brace-nesting levels in a single O(n + m) forward pass via _precompute_nesting_levels().
Timeout Management: Adding governor.check_timeout() between phases (lines 370-372) prevents unbounded work on large files.
Code Organization: New helper methods (_toggle_string, _skip_nesting_comment, _adjust_brace_depth) improve code readability and maintainability.
Well-Documented: The implementation includes clear comments explaining the algorithm and complexity improvements.

Issue to Address

Double Regex Scan (as noted in comment #3069768586):
Lines 489-492 execute combined_pattern.finditer(content) twice:

Once to build keyword_positions
Again in the main for loop

For very large files, this adds an extra full regex scan. The fix is straightforward:

# Materialize matches once to avoid double regex scan
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:

This change would maintain the performance benefits while eliminating the redundant regex execution.

Other Observations

CI Status: The review / review check is currently failing. The tests for Python 3.12, 3.13, and 3.14 are pending.
Code Correctness: The algorithm logic appears sound - the helper methods correctly handle:
- String state tracking with escape handling
- Comment skipping (single-line // and #, multi-line /* */)
- Brace depth management
Edge Cases: The implementation handles edge cases like positions beyond content end (line 589).

Recommendation

Please address the double regex scan issue mentioned above. Once that's fixed and the CI checks pass (particularly the Python 3.13 tests), this PR should be ready to merge.

The performance improvements are well-designed and should significantly reduce timeout issues on large files.

github-actions · 2026-04-12T16:52:10Z

Great catch on the double regex scan at lines 489-492. This is indeed a performance issue that should be fixed.

The solution is to materialize the matches once:

keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:

This maintains the O(n + m) optimization while avoiding the redundant regex scan on large inputs.

github-actions · 2026-04-12T16:52:33Z

Code Review - PR #335: Performance Optimization for Delimiter-Based Chunking

I've reviewed the changes in this PR which address the infinite loop issue and improve performance for Python 3.13. Here's my comprehensive analysis:

Summary of Changes

The PR implements a significant performance optimization by precomputing brace-nesting levels in a single O(n) pass instead of repeatedly scanning from position 0 for each keyword match. This transforms the algorithm from O(n * m) to O(n + m) complexity.

Key changes:

Added _precompute_nesting_levels() method to calculate nesting levels for all keyword positions in one pass
Added timeout checks between delimiter matching phases via the governor parameter
Introduced three new helper methods: _toggle_string(), _skip_nesting_comment(), and _adjust_brace_depth()

Positive Aspects

✅ Excellent performance improvement - The O(n * m) → O(n + m) optimization will dramatically reduce processing time for large files with many keywords

✅ Proper timeout handling - Passing the governor through to _find_delimiter_matches enables timeout checks between phases, preventing unbounded work

✅ Well-documented - The docstrings clearly explain the optimization rationale and complexity improvements

✅ Maintains correctness - The precomputation achieves the same result as the previous per-position scanning approach

Critical Issue: Code Duplication ⚠️

The comment from ID 3069768548 raises a valid and important concern about code duplication. The new _precompute_nesting_levels() method introduces its own mini-parser with these helpers:

_toggle_string() (lines 595-603)
_skip_nesting_comment() (lines 606-615)
_adjust_brace_depth() (lines 618-624)

These duplicate existing helpers in the same file:

_is_string_boundary() + _update_string_state() (lines 727-756)
_skip_comment() + _skip_comment_in_matching() (lines 758-953)

Why this matters:

Diverging behavior - The new _skip_nesting_comment() detects # comments differently (just checks c == "#") than existing code (checks two_chars in ("//", "#"))
Maintenance burden - Future bug fixes must be applied to multiple implementations
Future drift risk - The implementations may diverge further over time, leading to subtle bugs

Recommendations

High Priority:

Refactor to reuse existing helpers - Consider consolidating string/comment skipping logic into shared implementations that both the precomputation and existing matching logic can use
Ensure consistent # comment handling - The new code's simpler c == "#" check may behave differently than the existing two_chars in ("//", "#") pattern

Medium Priority:
3. Add unit tests - Specifically test the _precompute_nesting_levels() method with edge cases (nested braces in strings, comments, etc.)
4. Performance benchmarks - Document the actual performance improvement on large files to validate the optimization

Low Priority:
5. Consider extracting a StringCommentParser helper class that encapsulates the shared parsing logic

Test Status

Python 3.12, 3.13, and 3.14 tests are currently in progress ✓
The review job failure appears to be a Gemini CLI infrastructure issue, not related to code changes

Verdict

The performance optimization is sound and valuable, but the code duplication issue should be addressed before merging to prevent future maintenance problems. I recommend:

Refactoring to consolidate the duplicate string/comment parsing logic
Waiting for test results to confirm the fix resolves the Python 3.13 issues

Status: Requesting changes - please address the code duplication concern raised in the review comment.

github-actions

Performance Optimization Review

Based on my analysis of the code changes in src/codeweaver/engine/chunker/delimiter.py, here's my comprehensive review:

Summary

This PR introduces significant performance improvements to delimiter-based chunking by optimizing nesting level calculations and adding timeout checks. The changes address a critical O(n²) performance issue that caused timeouts on large files, particularly in Python 3.13.

Key Changes

1. Governor-Based Timeout Checks

Added governor parameter to _find_delimiter_matches()
Inserted timeout checks between delimiter matching phases
Assessment: ✅ Good defensive programming to prevent unbounded work

2. Precomputed Nesting Levels (Lines 485-490, 534-592)

Introduced _precompute_nesting_levels() method that computes brace nesting in O(n) instead of O(n*m)
Replaced per-keyword _calculate_nesting_level() calls with O(1) dictionary lookups
Assessment: ✅ Excellent optimization with clear performance benefit

3. Helper Methods

_toggle_string(): Manages string state transitions
_skip_nesting_comment(): Handles comment skipping during parsing
_adjust_brace_depth(): Updates brace depth counters
Assessment: ✅ Good refactoring for code clarity and reuse

Code Quality Analysis

Strengths:

✅ Well-documented: Excellent docstring explaining the O(n*m) → O(n+m) improvement
✅ Maintains correctness: The optimization doesn't change semantic behavior
✅ Clean separation: Helper methods are static where appropriate
✅ Consistent style: Follows existing code conventions

Concerns:

Line 606-615: Comment Detection Logic
```
if two_chars == "//" or c == "#":
```
- ⚠️ Mixing two_chars and single char c checks feels inconsistent
- Recommendation: Consider two_chars in ("/", "##") or similar
Line 575-578: Escape Character Handling
```
if c in ('"', "'", "\`") and (i == 0 or content[i - 1] != "\\"):
```
- ⚠️ Doesn't handle escaped backslashes (\\\" should still be a boundary)
- Recommendation: Consider handling consecutive backslashes
Line 519: Dictionary Lookup
```
nesting_level = nesting_at.get(keyword_pos, 0)
```
- Positions should always be in dict if precomputation worked correctly
- Recommendation: Consider assertion in debug mode to catch logic errors

Performance Impact

Before: O(n * m) → quadratic scaling
After: O(n + m) → linear scaling
Impact: Massive improvement on large files with many keywords

CI Status

The review workflow failed due to Gemini API quota exhaustion (HTTP 429), NOT code issues
Python 3.12/3.13/3.14 tests are still pending
Action needed: Wait for test results before merging

Recommendations

Before merging:

✅ Wait for Python test results to pass
⚠️ Add tests for escape sequence edge cases
⚠️ Consider adding performance regression tests

Future improvements:

Extract string/comment parsing to shared utility (DRY)
Add benchmarking to track performance
Consider caching precomputed results

Verdict

Conditional Approval ✅ (pending tests)

The code demonstrates solid engineering with a well-thought-out performance optimization. The implementation is clean, well-documented, and maintains correctness.

Once CI tests pass and the escape sequence concern is addressed, this is ready to merge.

Great work on this optimization! The precomputation approach is elegant and significantly improves performance on large files. 🚀

github-actions

PR Review Summary

This PR addresses Python 3.13 CI timeout issues through performance optimizations in delimiter-based chunking. The changes are well-designed and implement solid algorithmic improvements, but the review comment is valid - additional test coverage would strengthen confidence in the timeout governance behavior.

What This PR Does Well ✅

1. Algorithmic Optimization (Lines 475-531)

The core improvement is excellent. The PR eliminates O(n×m) complexity by precomputing brace-nesting levels in a single O(n+m) pass:

# Before: _calculate_nesting_level called per keyword match (quadratic)
nesting_level = self._calculate_nesting_level(content, keyword_pos)

# After: Single forward pass precomputes all levels (linear)
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
nesting_level = nesting_at.get(keyword_pos, 0)

This is a textbook performance optimization that directly addresses the root cause of timeouts on large files.

2. Clean Helper Functions (Lines 533-621)

The implementation is well-factored:

_precompute_nesting_levels: Main optimization logic
_toggle_string: String state tracking
_skip_nesting_comment: Comment handling
_adjust_brace_depth: Brace depth management

Each helper has a single, clear responsibility and improves readability.

3. Timeout Governance Threading (Lines 342-345, 370-373)

Threading the governor through _find_delimiter_matches and adding mid-phase timeout checks is the right approach:

Prevents unbounded work between explicit and keyword delimiter matching
Maintains consistent timeout behavior throughout the chunking pipeline
Uses keyword-only argument (*, governor) for clarity

Areas for Improvement 📝

Test Coverage Gap (Addressing Comment #3069768602)

The review comment is valid and important. While the optimization logic looks correct, the new timeout governance behavior lacks targeted regression tests. Recommended additions:

Test 1: Mid-Phase Timeout Check

def test_delimiter_matching_timeout_between_phases(mock_embedding_capability, chunker_settings):
    """Verify check_timeout is called between explicit and keyword delimiter phases."""
    from unittest.mock import patch, Mock
    from codeweaver.engine import DelimiterChunker, ResourceGovernor
    
    # Mock governor that tracks check_timeout calls
    mock_governor = Mock(spec=ResourceGovernor)
    
    chunker = DelimiterChunker(governor=mock_embedding_capability, settings=chunker_settings)
    content = """
function foo() { }
class Bar { }
def baz(): pass
"""
    
    # Call _find_delimiter_matches with mock governor
    with patch.object(chunker._chunk_governor, 'resource_governor', mock_governor):
        chunker._find_delimiter_matches(content, governor=mock_governor)
    
    # Verify check_timeout was called at least once (between phases)
    assert mock_governor.check_timeout.call_count >= 1

Test 2: Timeout Aborts Before Keyword Matching

def test_delimiter_matching_aborts_on_timeout_before_keyword_phase():
    """Verify timeout during explicit phase prevents keyword phase execution."""
    from unittest.mock import patch, Mock
    from codeweaver.engine import DelimiterChunker, ChunkingTimeoutError
    
    # Governor that times out between phases
    mock_governor = Mock()
    mock_governor.check_timeout.side_effect = ChunkingTimeoutError("Timeout")
    
    chunker = DelimiterChunker(...)
    content = "class Foo { }"  # Has keyword delimiter
    
    with pytest.raises(ChunkingTimeoutError):
        chunker._find_delimiter_matches(content, governor=mock_governor)
    
    # Keyword matching should not have been attempted
    # (can verify via instrumentation or side effects)

Code Quality Notes

Type Hint Consistency (Line 344)
governor: Any | None = None - Consider using the proper ResourceGovernor type instead of Any for better type safety, or document why Any is necessary.
Early Exit Optimization (Lines 390-391)
The early exit when all positions are recorded is good:
```
if pos_idx >= len(sorted_positions):
    break  # All positions recorded
```
Consider adding a comment explaining this prevents unnecessary iteration.
Documentation (Lines 534-544)
The docstring for _precompute_nesting_levels excellently explains the performance improvement and complexity reduction. This is great technical documentation.

Performance & Correctness ✅

String/Comment Handling: Properly handles strings, comments, and escape sequences
Brace Depth Safety: max(0, depth - 1) prevents negative depth
Position Sorting: Correctly sorts positions before iteration
Edge Cases: Handles empty position lists, positions beyond content length

Security Considerations ✅

No security concerns. The changes are purely computational optimizations for defensive tooling (code analysis). The timeout governance additions actually improve security by preventing resource exhaustion attacks via maliciously crafted input files.

Recommendation

Approve with minor request: The algorithmic improvements are excellent and should significantly reduce timeout issues in Python 3.13 CI. However, I recommend adding the suggested timeout governance tests before merging to:

Prevent regression if the governor threading logic changes
Document the expected timeout behavior
Ensure CI timeouts remain detectable in future refactoring

The code is production-ready, but the additional test coverage would provide valuable insurance against future breakage.

Code Review Score: 8.5/10

Excellent performance optimization ⭐⭐⭐⭐⭐
Clean code structure ⭐⭐⭐⭐
Good documentation ⭐⭐⭐⭐
Missing targeted tests for new timeout behavior ⚠️

Suggested Action: Add timeout governance tests, then approve and merge.

github-actions

Reply to comment #3069768602:

You're absolutely right about the test coverage gap. The timeout governance behavior is critical for preventing CI regressions, especially with Python 3.13.

I've detailed two specific test cases in my main review that would address this:

Verify mid-phase timeout checks: Mock the governor and assert check_timeout() is invoked between the explicit and keyword delimiter matching phases
Verify timeout aborts keyword matching: Simulate a timeout exception during the phase transition and confirm the keyword matching phase is never entered

These tests would:

Prevent future regressions if someone refactors the governor threading
Document the intended behavior for future maintainers
Catch issues in CI before they cause production timeouts

The pattern follows the existing test style in test_governance.py (lines 39-49) which already uses mocking to verify timeout behavior.

bashandbone · 2026-04-12T17:01:37Z

@copilot The PR comments focus on similar issues but conflict with one another in suggested implementation. I'd like you to review them all, assess which are likely to result in the cleanest, most maintainable and optimal outcome, and implement it.

Relevant discussion in:

this thread (has inline comments)
this other thread (has an inline comment)
this comment no inline comments
this comment
this comment
this comment identifies valid shortcomings to address
test improvements suggested here

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot · 2026-04-12T17:39:55Z

@copilot The PR comments focus on similar issues but conflict with one another in suggested implementation. I'd like you to review them all, assess which are likely to result in the cleanest, most mai...

Updates applied: consolidated nesting parsing around shared helpers, streamlined keyword hit tracking to avoid a second regex scan while keeping memory lean, aligned _calculate_nesting_level with the shared path, added a governor timeout phase-order test using a recording chunker, and ignored/removed coverage artifacts. Commits: bf5751d, 335334a, 2729b53. No UI changes.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

cas-server.xethub.hf.co
- Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -v (dns block)
openaipublic.blob.core.windows.net
- Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -m unit and not expensive and not requires_models -v /home/REDACTED/work/codeweaver/cod--pretty=format:%cI language-support/git test-codeweaver-bash down PYTEST_MARKS="unit and not expensive and not requires_models" echo "${CW_PREFIX} Running git desc�� --tags --abbrev=0 n/python3 :48819/health y utils/curl n/python3 (dns block)
- Triggering command: `/home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -m unit and not expensive and not requires_models -v e/chunker/test_delimiter_edge_cases.py .py tests/unit/engine/chunker/test_delimiter_edge_cases.py pensive and not /tmp/codeql-scratch-80aa01a8d31cb554/codeql/python/tools/linux64/tsg-python curl est_delimiter_edge_cases.py git comm�� -m fix: align delimiter nesting parsing

Agent-Logs-Url: REDACTED /tsg-python -v origin PYTEST_MARKS="unit and not expensive and not requires_models"
echo "${CW_PREFIX} Running /tsg-python` (dns block)

Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -m unit and not expensive and not requires_models -v HEAD n/sh -o errexit deweaver/engine/chunker/delimiteerrexit /bin/sh -c tracer.py git ib/jspawnhelper .py e-3268-455f-bcbadiff PYTEST_MARKS="unit and not expensive and not requires_models" echo "${CW_PREFIX} Running ib/jspawnhelper (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled

Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Changes before error encountered

c3528ea

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/4434677c-67a3-47ee-82d6-dd1290f8b94c Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI assigned Copilot and bashandbone Apr 12, 2026

Copilot AI review requested due to automatic review settings April 12, 2026 16:40

Copilot created this pull request from a session on behalf of bashandbone April 12, 2026 16:40 View session

Copilot AI review requested due to automatic review settings April 12, 2026 16:40

Copilot started work on behalf of bashandbone April 12, 2026 16:40 View session

Copilot AI requested a review from bashandbone April 12, 2026 16:44

Copilot finished work on behalf of bashandbone April 12, 2026 16:44

bashandbone marked this pull request as ready for review April 12, 2026 16:45

Copilot AI review requested due to automatic review settings April 12, 2026 16:45

Copilot started reviewing on behalf of bashandbone April 12, 2026 16:46 View session

sourcery-ai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread src/codeweaver/engine/chunker/delimiter.py Outdated

github-actions Bot reviewed Apr 12, 2026

View reviewed changes

Copilot AI reviewed Apr 12, 2026

View reviewed changes

github-actions Bot reviewed Apr 12, 2026

View reviewed changes

Copilot started work on behalf of bashandbone April 12, 2026 17:01 View session

Copilot AI and others added 2 commits April 12, 2026 17:23

chore: outline plan for review feedback

a739ef0

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

fix: align delimiter nesting parsing

bf5751d

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 12, 2026 17:30

refactor: streamline keyword hit tracking

335334a

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 12, 2026 17:35

test: use recording chunker for timeout check

2729b53

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 12, 2026 17:39

Copilot AI changed the title ~~Investigating infinite loop and potential fixes for Python 3.13 CI~~ Improve delimiter chunking performance and governor checks for Python 3.13 CI Apr 12, 2026

Copilot finished work on behalf of bashandbone April 12, 2026 17:40

bashandbone merged commit 90d81cf into main Apr 12, 2026
14 of 15 checks passed

bashandbone deleted the copilot/investigate-infinite-loop-python-3-13 branch April 12, 2026 18:52

Conversation

Copilot AI commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for delimiter matching with resource governor and nesting precomputation

Updated class diagram for delimiter keyword matching and nesting helpers

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Code Review for PR #335

High Priority Issues

1. Escaped Quote Handling Bug ⚠️

2. Duplicate Regex Execution 🔄

3. Type Safety: Governor Parameter 📝

Positive Aspects ✅

Additional Recommendations

CI Status

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Positive Aspects

Issue to Address

Other Observations

Recommendation

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Positive Aspects

Issue to Address

Other Observations

Recommendation

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Code Review - PR #335: Performance Optimization for Delimiter-Based Chunking

Summary of Changes

Positive Aspects

Critical Issue: Code Duplication ⚠️

Recommendations

Test Status

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Performance Optimization Review

Summary

Key Changes

Code Quality Analysis

Performance Impact

CI Status

Recommendations

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Copilot AI commented Apr 12, 2026 •

edited

Loading

sourcery-ai Bot commented Apr 12, 2026 •

edited

Loading

Copilot AI commented Apr 12, 2026 •

edited

Loading