Skip to content

Improve delimiter chunking performance and governor checks for Python 3.13 CI#335

Merged
bashandbone merged 5 commits intomainfrom
copilot/investigate-infinite-loop-python-3-13
Apr 12, 2026
Merged

Improve delimiter chunking performance and governor checks for Python 3.13 CI#335
bashandbone merged 5 commits intomainfrom
copilot/investigate-infinite-loop-python-3-13

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 12, 2026

Pull request created by AI Agent

Summary by Sourcery

Improve delimiter-based chunking performance and timeout behavior when scanning large source files.

Enhancements:

  • Pass the resource governor into delimiter matching to allow timeout checks between matching phases.
  • Precompute brace nesting levels for keyword delimiter positions in a single pass to avoid quadratic-time scans on large files.
  • Reuse existing string/comment parsing helpers so nesting calculations and delimiter matching stay consistent.
  • Add a unit test to assert the mid-phase governor timeout check is invoked between explicit and keyword matching.

Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/4434677c-67a3-47ee-82d6-dd1290f8b94c

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 12, 2026 16:40
Copilot AI review requested due to automatic review settings April 12, 2026 16:40
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Apr 12, 2026

Reviewer's Guide

Threads an optional resource governor into delimiter matching and replaces per-keyword nesting recomputation with a single-pass precomputation to make keyword delimiter matching linear-time and more timeout-safe.

Sequence diagram for delimiter matching with resource governor and nesting precomputation

sequenceDiagram
    actor Governor
    participant Caller
    participant DelimiterChunker
    participant KeywordMatcher as _match_keyword_delimiters
    participant NestingPrecomp as _precompute_nesting_levels

    Caller->>DelimiterChunker: _get_matches_with_fallback(content, governor)
    DelimiterChunker->>Governor: check_timeout()
    DelimiterChunker->>DelimiterChunker: _find_delimiter_matches(content, governor)

    rect rgb(235, 245, 255)
        DelimiterChunker->>DelimiterChunker: _match_explicit_delimiters(content)
    end

    alt governor provided
        DelimiterChunker->>Governor: check_timeout()
    end

    rect rgb(235, 245, 255)
        DelimiterChunker->>KeywordMatcher: _match_keyword_delimiters(content, keyword_delimiters)
        KeywordMatcher->>NestingPrecomp: _precompute_nesting_levels(content, keyword_positions)
        NestingPrecomp-->>KeywordMatcher: nesting_at
        loop for each keyword match
            KeywordMatcher->>KeywordMatcher: lookup nesting_at[keyword_pos]
            KeywordMatcher->>KeywordMatcher: build DelimiterMatch using struct_end and nesting_level
        end
    end

    DelimiterChunker-->>Caller: matches

    opt no matches
        DelimiterChunker->>DelimiterChunker: _fallback_paragraph_chunking(content)
        DelimiterChunker-->>Caller: fallback matches
    end
Loading

Updated class diagram for delimiter keyword matching and nesting helpers

classDiagram
    class DelimiterChunker {
        +_get_matches_with_fallback(content: str, governor: Any) list~DelimiterMatch~
        +_enforce_chunk_limit(chunks: list~CodeChunk~, file_path: Path | None) void
        +_find_delimiter_matches(content: str, governor: Any | None) list~DelimiterMatch~
        +_match_keyword_delimiters(content: str, keyword_delimiters: list~Delimiter~) list~DelimiterMatch~
        +_precompute_nesting_levels(content: str, positions: list~int~) dict~int, int~
        +_calculate_nesting_level(content: str, pos: int) int
        +_toggle_string(c: str, in_string: bool, string_char: str | None) tuple~bool, str | None~
        +_skip_nesting_comment(content: str, i: int, c: str, content_len: int) int | None
        +_adjust_brace_depth(c: str, depth: int) int
    }

    class ResourceGovernor {
        +check_timeout() void
    }

    class DelimiterMatch
    class CodeChunk
    class Delimiter
    class Path

    DelimiterChunker ..> ResourceGovernor : uses
    DelimiterChunker ..> DelimiterMatch : creates
    DelimiterChunker ..> CodeChunk : manages
    DelimiterChunker ..> Delimiter : matches
    DelimiterChunker ..> Path : reads_path
Loading

File-Level Changes

Change Details Files
Propagate an optional resource governor into delimiter matching and add an inter-phase timeout check.
  • Update _get_matches_with_fallback to pass the governor into _find_delimiter_matches
  • Extend _find_delimiter_matches signature to accept a keyword-only governor parameter
  • Insert a governor.check_timeout() call between explicit delimiter and keyword delimiter matching phases
src/codeweaver/engine/chunker/delimiter.py
Optimize keyword delimiter matching by precomputing brace nesting levels in a single pass instead of per-keyword scans.
  • Precompute keyword positions from the combined keyword regex and derive a position-to-nesting map via a new _precompute_nesting_levels helper
  • Use precomputed nesting levels in _match_keyword_delimiters instead of calling _calculate_nesting_level for each match
  • Introduce helper utilities _toggle_string, _skip_nesting_comment, and _adjust_brace_depth to support efficient, comment- and string-aware nesting computation
src/codeweaver/engine/chunker/delimiter.py

Possibly linked issues

  • #(not specified): PR optimizes chunker delimiter keyword matching and adds timeout checks, directly targeting the 3.13 performance/timeout regressions.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copilot AI requested a review from bashandbone April 12, 2026 16:44
@bashandbone bashandbone marked this pull request as ready for review April 12, 2026 16:45
Copilot AI review requested due to automatic review settings April 12, 2026 16:45
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Hi @Copilot, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

🤖 I'm sorry @Copilot, but I was unable to process your request. Please see the logs for more details.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new string-tracking logic in _precompute_nesting_levels only checks content[i - 1] != "\\" to detect escaped quotes, which will mis-handle cases like \\" (even number of backslashes); consider using a parity check on consecutive backslashes so you don't incorrectly toggle in_string inside escaped quotes.
  • You call combined_pattern.finditer(content) twice in _match_keyword_delimiters (once to build keyword_positions and once in the main loop); consider collecting the matches in a list so you can both precompute positions and iterate once, avoiding a second regex pass over the content.
  • The new governor parameter on _find_delimiter_matches is typed as Any | None; if the governor interface is stable, tightening this to a protocol or concrete type with check_timeout() would make the contract clearer and prevent accidental misuse.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new string-tracking logic in `_precompute_nesting_levels` only checks `content[i - 1] != "\\"` to detect escaped quotes, which will mis-handle cases like `\\"` (even number of backslashes); consider using a parity check on consecutive backslashes so you don't incorrectly toggle `in_string` inside escaped quotes.
- You call `combined_pattern.finditer(content)` twice in `_match_keyword_delimiters` (once to build `keyword_positions` and once in the main loop); consider collecting the matches in a list so you can both precompute positions and iterate once, avoiding a second regex pass over the content.
- The new `governor` parameter on `_find_delimiter_matches` is typed as `Any | None`; if the governor interface is stable, tightening this to a protocol or concrete type with `check_timeout()` would make the contract clearer and prevent accidental misuse.

## Individual Comments

### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="485-489" />
<code_context>
         for delimiter in keyword_delimiters:
             delimiter_map.setdefault(delimiter.start, []).append(delimiter)

+        # Precompute brace-nesting levels at all keyword positions in a single
+        # O(n) forward pass.  The previous approach called _calculate_nesting_level
+        # per keyword match, each scanning from position 0, resulting in O(n * m)
+        # total work that caused timeouts on large files (especially Python 3.13).
+        keyword_positions = [m.start() for m in combined_pattern.finditer(content)]
+        nesting_at = self._precompute_nesting_levels(content, keyword_positions)
+
</code_context>
<issue_to_address>
**suggestion (performance):** Avoid running the same regex twice by caching matches before computing positions and iterating.

`combined_pattern.finditer(content)` is currently called once to build `keyword_positions` and again in the main loop, doubling regex work on large files. Consider calling `finditer` once, storing the matches in a list, deriving `keyword_positions` from that list, and then reusing the same list when constructing `DelimiterMatch` instances to avoid the extra pass.

Suggested implementation:

```python
        # Precompute brace-nesting levels at all keyword positions in a single
        # O(n) forward pass.  The previous approach called _calculate_nesting_level
        # per keyword match, each scanning from position 0, resulting in O(n * m)
        # total work that caused timeouts on large files (especially Python 3.13).
        # Materialize matches once so we can reuse them both for position
        # computation and when constructing DelimiterMatch instances.
        keyword_matches = list(combined_pattern.finditer(content))
        keyword_positions = [m.start() for m in keyword_matches]
        nesting_at = self._precompute_nesting_levels(content, keyword_positions)

```

```python
        for match in keyword_matches:

```

The second SEARCH block assumes the main keyword-matching loop literally uses `for match in combined_pattern.finditer(content):`. If the surrounding code differs (e.g. assigns the iterator to a variable first or uses a comprehension), adjust that SEARCH pattern to replace whatever second invocation of `combined_pattern.finditer(content)` is used to iterate over keyword matches with `keyword_matches` instead. Ensure that:
1. The declaration of `keyword_matches = list(combined_pattern.finditer(content))` is in the same scope and precedes any use of `keyword_matches`.
2. No other code path relies on `combined_pattern.finditer(content)` being a fresh iterator (in typical usage it does not, so replacing with the precomputed list is safe).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/codeweaver/engine/chunker/delimiter.py Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review for PR #335

I've reviewed the performance optimization changes for delimiter-based chunking. This PR addresses Python 3.13 CI timeout issues by optimizing the keyword delimiter matching algorithm from O(n*m) to O(n+m) complexity. Overall, the changes are well-structured and achieve their goal, but there are several important improvements needed.

High Priority Issues

1. Escaped Quote Handling Bug ⚠️

Location: src/codeweaver/engine/chunker/delimiter.py:575

The escape detection logic in _precompute_nesting_levels is incorrect:

if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):

Problem: This fails for cases like \\" (even number of backslashes). The string "test\\" would incorrectly toggle the string state because content[i-1] is \, but this is actually an escaped backslash, not an escape for the quote.

Fix: Count consecutive backslashes to determine if the quote is truly escaped:

def _is_quote_escaped(content: str, pos: int) -> bool:
    """Check if quote at position is escaped by counting backslashes."""
    if pos == 0:
        return False
    
    # Count consecutive backslashes before quote
    backslash_count = 0
    check_pos = pos - 1
    while check_pos >= 0 and content[check_pos] == "\\":
        backslash_count += 1
        check_pos -= 1
    
    # Odd number of backslashes means the quote is escaped
    return backslash_count % 2 == 1

# Then use:
if c in ('"', "'", "`") and not self._is_quote_escaped(content, i):

This same bug exists in other methods (_calculate_nesting_level:647, _update_string_state:754, _process_string_in_matching:927). All should be fixed consistently.

2. Duplicate Regex Execution 🔄

Location: src/codeweaver/engine/chunker/delimiter.py:489-492

The code calls combined_pattern.finditer(content) twice:

  1. Line 489: To extract positions for _precompute_nesting_levels
  2. Line 492: To iterate and create matches

Impact: On large files, regex matching can be expensive, effectively doubling the regex work.

Fix: Materialize matches once:

# Materialize matches once to avoid duplicate regex work
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:  # Use materialized matches
    matched_text = match.group(0)
    # ... rest of loop

3. Type Safety: Governor Parameter 📝

Location: src/codeweaver/engine/chunker/delimiter.py:344

The governor parameter is typed as Any | None, which defeats type checking benefits.

Current:

def _find_delimiter_matches(
    self, content: str, *, governor: Any | None = None
) -> list[DelimiterMatch]:

Recommended: Create a Protocol for the governor interface:

from typing import Protocol

class TimeoutCheckable(Protocol):
    """Protocol for objects that support timeout checking."""
    def check_timeout(self) -> None: ...

def _find_delimiter_matches(
    self, content: str, *, governor: TimeoutCheckable | None = None
) -> list[DelimiterMatch]:

Looking at governance.py:33-106, ResourceGovernor has a check_timeout() method, so this protocol would be satisfied. This provides better type safety and makes the contract explicit.

Positive Aspects ✅

  1. Performance Impact: The O(n*m) → O(n+m) optimization is excellent and directly addresses the timeout issue
  2. Clean Helper Methods: _toggle_string, _skip_nesting_comment, and _adjust_brace_depth are well-factored
  3. Documentation: The docstring for _precompute_nesting_levels clearly explains the optimization rationale
  4. Timeout Checks: Properly adding governor.check_timeout() between phases prevents unbounded work

Additional Recommendations

  1. Test Coverage: Ensure tests cover:

    • Escaped quotes: "test\\", 'test\\\\', mixed escaping
    • Large files that previously timed out
    • Edge cases in nesting precomputation
  2. Consider Multi-character String Delimiters: The current implementation only handles single-character quotes. Languages like Python support """ (triple quotes). This could be a follow-up enhancement.

CI Status

The review job is currently failing. Tests for Python 3.12, 3.13, and 3.14 are pending. I recommend:

  1. Addressing the issues above
  2. Waiting for CI to complete to ensure the fixes resolve the timeout issues

Summary

The core optimization strategy is sound and should resolve the Python 3.13 performance issues. However, the escaped quote handling bug and duplicate regex execution should be fixed before merging to ensure correctness and maximize the performance gains.

Recommendation: Request changes to address the three high-priority issues before approval.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets delimiter-based chunking performance and timeout behavior, motivated by Python 3.13 CI timeouts/infinite-loop symptoms, by introducing resource-governed execution and reducing keyword delimiter matching complexity.

Changes:

  • Thread an optional governor into delimiter match discovery and add a mid-phase check_timeout() between explicit and keyword matching.
  • Reduce keyword delimiter nesting computation from repeated rescans to a single forward pass with O(1) nesting lookup per keyword match.
  • Add helper utilities to support the new single-pass nesting precomputation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +534 to +616
def _precompute_nesting_levels(
self, content: str, positions: list[int]
) -> dict[int, int]:
"""Precompute brace-nesting levels at given positions in a single forward pass.

Replaces per-position calls to ``_calculate_nesting_level`` which each
scanned from position 0, yielding O(n * m) total work. This method
achieves the same result in O(n + m) by walking the content once and
recording the running brace depth at each requested position.

Args:
content: Source code
positions: Character offsets whose nesting level is needed

Returns:
Mapping from position to nesting level (0 = top-level)
"""
if not positions:
return {}

result: dict[int, int] = {}
sorted_positions = sorted(positions)
pos_idx = 0
brace_depth = 0
in_string = False
string_char: str | None = None
content_len = len(content)
i = 0

while i < content_len:
# Record nesting level for every target position we have reached
while pos_idx < len(sorted_positions) and sorted_positions[pos_idx] <= i:
result[sorted_positions[pos_idx]] = brace_depth
pos_idx += 1

if pos_idx >= len(sorted_positions):
break # All positions recorded

c = content[i]

# Track string boundaries
if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):
in_string, string_char = self._toggle_string(
c, in_string=in_string, string_char=string_char
)
elif not in_string:
skip_to = self._skip_nesting_comment(content, i, c, content_len)
if skip_to is not None:
i = skip_to
continue
brace_depth = self._adjust_brace_depth(c, brace_depth)

i += 1

# Any remaining positions beyond the end of content
for p in sorted_positions[pos_idx:]:
result[p] = brace_depth

return result

@staticmethod
def _toggle_string(
c: str, *, in_string: bool, string_char: str | None
) -> tuple[bool, str | None]:
"""Toggle string state for quote character."""
if not in_string:
return True, c
if c == string_char:
return False, None
return in_string, string_char

@staticmethod
def _skip_nesting_comment(content: str, i: int, c: str, content_len: int) -> int | None:
"""Return new index if a comment starts at *i*, else None."""
two_chars = content[i : i + 2]
if two_chars == "//" or c == "#":
nl = content.find("\n", i)
return nl if nl >= 0 else content_len
if two_chars == "/*":
end = content.find("*/", i + 2)
return end + 2 if end >= 0 else content_len
return None

Comment on lines 485 to 626
@@ -501,8 +515,8 @@ def _match_keyword_delimiters(
)

if struct_end is not None:
# Calculate nesting level by counting parent structures
nesting_level = self._calculate_nesting_level(content, keyword_pos)
# Look up precomputed nesting level (O(1) per keyword)
nesting_level = nesting_at.get(keyword_pos, 0)

# Create a complete match from keyword to closing structure
# This represents the entire construct (e.g., function...})
@@ -517,6 +531,98 @@ def _match_keyword_delimiters(

return matches

def _precompute_nesting_levels(
self, content: str, positions: list[int]
) -> dict[int, int]:
"""Precompute brace-nesting levels at given positions in a single forward pass.

Replaces per-position calls to ``_calculate_nesting_level`` which each
scanned from position 0, yielding O(n * m) total work. This method
achieves the same result in O(n + m) by walking the content once and
recording the running brace depth at each requested position.

Args:
content: Source code
positions: Character offsets whose nesting level is needed

Returns:
Mapping from position to nesting level (0 = top-level)
"""
if not positions:
return {}

result: dict[int, int] = {}
sorted_positions = sorted(positions)
pos_idx = 0
brace_depth = 0
in_string = False
string_char: str | None = None
content_len = len(content)
i = 0

while i < content_len:
# Record nesting level for every target position we have reached
while pos_idx < len(sorted_positions) and sorted_positions[pos_idx] <= i:
result[sorted_positions[pos_idx]] = brace_depth
pos_idx += 1

if pos_idx >= len(sorted_positions):
break # All positions recorded

c = content[i]

# Track string boundaries
if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):
in_string, string_char = self._toggle_string(
c, in_string=in_string, string_char=string_char
)
elif not in_string:
skip_to = self._skip_nesting_comment(content, i, c, content_len)
if skip_to is not None:
i = skip_to
continue
brace_depth = self._adjust_brace_depth(c, brace_depth)

i += 1

# Any remaining positions beyond the end of content
for p in sorted_positions[pos_idx:]:
result[p] = brace_depth

return result

@staticmethod
def _toggle_string(
c: str, *, in_string: bool, string_char: str | None
) -> tuple[bool, str | None]:
"""Toggle string state for quote character."""
if not in_string:
return True, c
if c == string_char:
return False, None
return in_string, string_char

@staticmethod
def _skip_nesting_comment(content: str, i: int, c: str, content_len: int) -> int | None:
"""Return new index if a comment starts at *i*, else None."""
two_chars = content[i : i + 2]
if two_chars == "//" or c == "#":
nl = content.find("\n", i)
return nl if nl >= 0 else content_len
if two_chars == "/*":
end = content.find("*/", i + 2)
return end + 2 if end >= 0 else content_len
return None

@staticmethod
def _adjust_brace_depth(c: str, depth: int) -> int:
"""Adjust brace depth for open/close brace characters."""
if c == "{":
return depth + 1
if c == "}":
return max(0, depth - 1)
return depth

def _calculate_nesting_level(self, content: str, pos: int) -> int:
Comment on lines 485 to 492
# Precompute brace-nesting levels at all keyword positions in a single
# O(n) forward pass. The previous approach called _calculate_nesting_level
# per keyword match, each scanning from position 0, resulting in O(n * m)
# total work that caused timeouts on large files (especially Python 3.13).
keyword_positions = [m.start() for m in combined_pattern.finditer(content)]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in combined_pattern.finditer(content):
Comment on lines 343 to 375
@@ -364,6 +367,10 @@ def _find_delimiter_matches(self, content: str) -> list[DelimiterMatch]:
# Phase 1: Handle explicit start/end pairs (existing logic)
matches.extend(self._match_explicit_delimiters(content, explicit_delimiters))

# Check timeout between phases to avoid unbounded work
if governor is not None:
governor.check_timeout()

# Phase 2: Handle keyword delimiters with empty ends
matches.extend(self._match_keyword_delimiters(content, keyword_delimiters))
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

I've reviewed PR #335 which aims to investigate and fix an infinite loop issue affecting Python 3.13 CI. The changes focus on performance optimizations in the delimiter-based chunking system.

Positive Aspects

  1. Performance Optimization: The PR successfully addresses O(n * m) complexity by precomputing brace-nesting levels in a single O(n + m) forward pass via .

  2. Timeout Management: Adding between phases (lines 370-372) prevents unbounded work on large files.

  3. Code Organization: New helper methods (, , ) improve code readability and maintainability.

  4. Well-Documented: The implementation includes clear comments explaining the algorithm and complexity improvements.

Issue to Address

Double Regex Scan (as noted in comment #3069768586):
Lines 489-492 execute combined_pattern.finditer(content) twice:

  • Once to build keyword_positions
  • Again in the main for loop

For very large files, this adds an extra full regex scan. The fix is straightforward:

# Materialize matches once to avoid double regex scan
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:

This change would maintain the performance benefits while eliminating the redundant regex execution.

Other Observations

  1. CI Status: The review / review check is currently failing. The tests for Python 3.12, 3.13, and 3.14 are pending.

  2. Code Correctness: The algorithm logic appears sound - the helper methods correctly handle:

    • String state tracking with escape handling
    • Comment skipping (single-line // and #, multi-line /* */)
    • Brace depth management
  3. Edge Cases: The implementation handles edge cases like positions beyond content end (line 589).

Recommendation

Please address the double regex scan issue mentioned above. Once that's fixed and the CI checks pass (particularly the Python 3.13 tests), this PR should be ready to merge.

The performance improvements are well-designed and should significantly reduce timeout issues on large files.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

I've reviewed PR #335 which aims to investigate and fix an infinite loop issue affecting Python 3.13 CI. The changes focus on performance optimizations in the delimiter-based chunking system.

Positive Aspects

  1. Performance Optimization: The PR successfully addresses O(n * m) complexity by precomputing brace-nesting levels in a single O(n + m) forward pass via _precompute_nesting_levels().

  2. Timeout Management: Adding governor.check_timeout() between phases (lines 370-372) prevents unbounded work on large files.

  3. Code Organization: New helper methods (_toggle_string, _skip_nesting_comment, _adjust_brace_depth) improve code readability and maintainability.

  4. Well-Documented: The implementation includes clear comments explaining the algorithm and complexity improvements.

Issue to Address

Double Regex Scan (as noted in comment #3069768586):
Lines 489-492 execute combined_pattern.finditer(content) twice:

  • Once to build keyword_positions
  • Again in the main for loop

For very large files, this adds an extra full regex scan. The fix is straightforward:

# Materialize matches once to avoid double regex scan
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:

This change would maintain the performance benefits while eliminating the redundant regex execution.

Other Observations

  1. CI Status: The review / review check is currently failing. The tests for Python 3.12, 3.13, and 3.14 are pending.

  2. Code Correctness: The algorithm logic appears sound - the helper methods correctly handle:

    • String state tracking with escape handling
    • Comment skipping (single-line // and #, multi-line /* */)
    • Brace depth management
  3. Edge Cases: The implementation handles edge cases like positions beyond content end (line 589).

Recommendation

Please address the double regex scan issue mentioned above. Once that's fixed and the CI checks pass (particularly the Python 3.13 tests), this PR should be ready to merge.

The performance improvements are well-designed and should significantly reduce timeout issues on large files.

@github-actions
Copy link
Copy Markdown
Contributor

Great catch on the double regex scan at lines 489-492. This is indeed a performance issue that should be fixed.

The solution is to materialize the matches once:

keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)

for match in keyword_matches:

This maintains the O(n + m) optimization while avoiding the redundant regex scan on large inputs.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review - PR #335: Performance Optimization for Delimiter-Based Chunking

I've reviewed the changes in this PR which address the infinite loop issue and improve performance for Python 3.13. Here's my comprehensive analysis:

Summary of Changes

The PR implements a significant performance optimization by precomputing brace-nesting levels in a single O(n) pass instead of repeatedly scanning from position 0 for each keyword match. This transforms the algorithm from O(n * m) to O(n + m) complexity.

Key changes:

  1. Added _precompute_nesting_levels() method to calculate nesting levels for all keyword positions in one pass
  2. Added timeout checks between delimiter matching phases via the governor parameter
  3. Introduced three new helper methods: _toggle_string(), _skip_nesting_comment(), and _adjust_brace_depth()

Positive Aspects

Excellent performance improvement - The O(n * m) → O(n + m) optimization will dramatically reduce processing time for large files with many keywords

Proper timeout handling - Passing the governor through to _find_delimiter_matches enables timeout checks between phases, preventing unbounded work

Well-documented - The docstrings clearly explain the optimization rationale and complexity improvements

Maintains correctness - The precomputation achieves the same result as the previous per-position scanning approach

Critical Issue: Code Duplication ⚠️

The comment from ID 3069768548 raises a valid and important concern about code duplication. The new _precompute_nesting_levels() method introduces its own mini-parser with these helpers:

  • _toggle_string() (lines 595-603)
  • _skip_nesting_comment() (lines 606-615)
  • _adjust_brace_depth() (lines 618-624)

These duplicate existing helpers in the same file:

  • _is_string_boundary() + _update_string_state() (lines 727-756)
  • _skip_comment() + _skip_comment_in_matching() (lines 758-953)

Why this matters:

  1. Diverging behavior - The new _skip_nesting_comment() detects # comments differently (just checks c == "#") than existing code (checks two_chars in ("//", "#"))
  2. Maintenance burden - Future bug fixes must be applied to multiple implementations
  3. Future drift risk - The implementations may diverge further over time, leading to subtle bugs

Recommendations

High Priority:

  1. Refactor to reuse existing helpers - Consider consolidating string/comment skipping logic into shared implementations that both the precomputation and existing matching logic can use
  2. Ensure consistent # comment handling - The new code's simpler c == "#" check may behave differently than the existing two_chars in ("//", "#") pattern

Medium Priority:
3. Add unit tests - Specifically test the _precompute_nesting_levels() method with edge cases (nested braces in strings, comments, etc.)
4. Performance benchmarks - Document the actual performance improvement on large files to validate the optimization

Low Priority:
5. Consider extracting a StringCommentParser helper class that encapsulates the shared parsing logic

Test Status

  • Python 3.12, 3.13, and 3.14 tests are currently in progress
  • The review job failure appears to be a Gemini CLI infrastructure issue, not related to code changes

Verdict

The performance optimization is sound and valuable, but the code duplication issue should be addressed before merging to prevent future maintenance problems. I recommend:

  1. Refactoring to consolidate the duplicate string/comment parsing logic
  2. Waiting for test results to confirm the fix resolves the Python 3.13 issues

Status: Requesting changes - please address the code duplication concern raised in the review comment.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance Optimization Review

Based on my analysis of the code changes in src/codeweaver/engine/chunker/delimiter.py, here's my comprehensive review:

Summary

This PR introduces significant performance improvements to delimiter-based chunking by optimizing nesting level calculations and adding timeout checks. The changes address a critical O(n²) performance issue that caused timeouts on large files, particularly in Python 3.13.

Key Changes

1. Governor-Based Timeout Checks

  • Added governor parameter to _find_delimiter_matches()
  • Inserted timeout checks between delimiter matching phases
  • Assessment: ✅ Good defensive programming to prevent unbounded work

2. Precomputed Nesting Levels (Lines 485-490, 534-592)

  • Introduced _precompute_nesting_levels() method that computes brace nesting in O(n) instead of O(n*m)
  • Replaced per-keyword _calculate_nesting_level() calls with O(1) dictionary lookups
  • Assessment: ✅ Excellent optimization with clear performance benefit

3. Helper Methods

  • _toggle_string(): Manages string state transitions
  • _skip_nesting_comment(): Handles comment skipping during parsing
  • _adjust_brace_depth(): Updates brace depth counters
  • Assessment: ✅ Good refactoring for code clarity and reuse

Code Quality Analysis

Strengths:

  1. Well-documented: Excellent docstring explaining the O(n*m) → O(n+m) improvement
  2. Maintains correctness: The optimization doesn't change semantic behavior
  3. Clean separation: Helper methods are static where appropriate
  4. Consistent style: Follows existing code conventions

Concerns:

  1. Line 606-615: Comment Detection Logic

    if two_chars == "//" or c == "#":
    • ⚠️ Mixing two_chars and single char c checks feels inconsistent
    • Recommendation: Consider two_chars in ("/", "##") or similar
  2. Line 575-578: Escape Character Handling

    if c in ('"', "'", "\`") and (i == 0 or content[i - 1] != "\\"):
    • ⚠️ Doesn't handle escaped backslashes (\\\" should still be a boundary)
    • Recommendation: Consider handling consecutive backslashes
  3. Line 519: Dictionary Lookup

    nesting_level = nesting_at.get(keyword_pos, 0)
    • Positions should always be in dict if precomputation worked correctly
    • Recommendation: Consider assertion in debug mode to catch logic errors

Performance Impact

  • Before: O(n * m) → quadratic scaling
  • After: O(n + m) → linear scaling
  • Impact: Massive improvement on large files with many keywords

CI Status

  • The review workflow failed due to Gemini API quota exhaustion (HTTP 429), NOT code issues
  • Python 3.12/3.13/3.14 tests are still pending
  • Action needed: Wait for test results before merging

Recommendations

Before merging:

  1. ✅ Wait for Python test results to pass
  2. ⚠️ Add tests for escape sequence edge cases
  3. ⚠️ Consider adding performance regression tests

Future improvements:

  • Extract string/comment parsing to shared utility (DRY)
  • Add benchmarking to track performance
  • Consider caching precomputed results

Verdict

Conditional Approval ✅ (pending tests)

The code demonstrates solid engineering with a well-thought-out performance optimization. The implementation is clean, well-documented, and maintains correctness.

Once CI tests pass and the escape sequence concern is addressed, this is ready to merge.

Great work on this optimization! The precomputation approach is elegant and significantly improves performance on large files. 🚀

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

This PR addresses Python 3.13 CI timeout issues through performance optimizations in delimiter-based chunking. The changes are well-designed and implement solid algorithmic improvements, but the review comment is valid - additional test coverage would strengthen confidence in the timeout governance behavior.

What This PR Does Well ✅

1. Algorithmic Optimization (Lines 475-531)

The core improvement is excellent. The PR eliminates O(n×m) complexity by precomputing brace-nesting levels in a single O(n+m) pass:

# Before: _calculate_nesting_level called per keyword match (quadratic)
nesting_level = self._calculate_nesting_level(content, keyword_pos)

# After: Single forward pass precomputes all levels (linear)
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
nesting_level = nesting_at.get(keyword_pos, 0)

This is a textbook performance optimization that directly addresses the root cause of timeouts on large files.

2. Clean Helper Functions (Lines 533-621)

The implementation is well-factored:

  • _precompute_nesting_levels: Main optimization logic
  • _toggle_string: String state tracking
  • _skip_nesting_comment: Comment handling
  • _adjust_brace_depth: Brace depth management

Each helper has a single, clear responsibility and improves readability.

3. Timeout Governance Threading (Lines 342-345, 370-373)

Threading the governor through _find_delimiter_matches and adding mid-phase timeout checks is the right approach:

  • Prevents unbounded work between explicit and keyword delimiter matching
  • Maintains consistent timeout behavior throughout the chunking pipeline
  • Uses keyword-only argument (*, governor) for clarity

Areas for Improvement 📝

Test Coverage Gap (Addressing Comment #3069768602)

The review comment is valid and important. While the optimization logic looks correct, the new timeout governance behavior lacks targeted regression tests. Recommended additions:

Test 1: Mid-Phase Timeout Check

def test_delimiter_matching_timeout_between_phases(mock_embedding_capability, chunker_settings):
    """Verify check_timeout is called between explicit and keyword delimiter phases."""
    from unittest.mock import patch, Mock
    from codeweaver.engine import DelimiterChunker, ResourceGovernor
    
    # Mock governor that tracks check_timeout calls
    mock_governor = Mock(spec=ResourceGovernor)
    
    chunker = DelimiterChunker(governor=mock_embedding_capability, settings=chunker_settings)
    content = """
function foo() { }
class Bar { }
def baz(): pass
"""
    
    # Call _find_delimiter_matches with mock governor
    with patch.object(chunker._chunk_governor, 'resource_governor', mock_governor):
        chunker._find_delimiter_matches(content, governor=mock_governor)
    
    # Verify check_timeout was called at least once (between phases)
    assert mock_governor.check_timeout.call_count >= 1

Test 2: Timeout Aborts Before Keyword Matching

def test_delimiter_matching_aborts_on_timeout_before_keyword_phase():
    """Verify timeout during explicit phase prevents keyword phase execution."""
    from unittest.mock import patch, Mock
    from codeweaver.engine import DelimiterChunker, ChunkingTimeoutError
    
    # Governor that times out between phases
    mock_governor = Mock()
    mock_governor.check_timeout.side_effect = ChunkingTimeoutError("Timeout")
    
    chunker = DelimiterChunker(...)
    content = "class Foo { }"  # Has keyword delimiter
    
    with pytest.raises(ChunkingTimeoutError):
        chunker._find_delimiter_matches(content, governor=mock_governor)
    
    # Keyword matching should not have been attempted
    # (can verify via instrumentation or side effects)

Code Quality Notes

  1. Type Hint Consistency (Line 344)
    governor: Any | None = None - Consider using the proper ResourceGovernor type instead of Any for better type safety, or document why Any is necessary.

  2. Early Exit Optimization (Lines 390-391)
    The early exit when all positions are recorded is good:

    if pos_idx >= len(sorted_positions):
        break  # All positions recorded

    Consider adding a comment explaining this prevents unnecessary iteration.

  3. Documentation (Lines 534-544)
    The docstring for _precompute_nesting_levels excellently explains the performance improvement and complexity reduction. This is great technical documentation.

Performance & Correctness ✅

  1. String/Comment Handling: Properly handles strings, comments, and escape sequences
  2. Brace Depth Safety: max(0, depth - 1) prevents negative depth
  3. Position Sorting: Correctly sorts positions before iteration
  4. Edge Cases: Handles empty position lists, positions beyond content length

Security Considerations ✅

No security concerns. The changes are purely computational optimizations for defensive tooling (code analysis). The timeout governance additions actually improve security by preventing resource exhaustion attacks via maliciously crafted input files.

Recommendation

Approve with minor request: The algorithmic improvements are excellent and should significantly reduce timeout issues in Python 3.13 CI. However, I recommend adding the suggested timeout governance tests before merging to:

  1. Prevent regression if the governor threading logic changes
  2. Document the expected timeout behavior
  3. Ensure CI timeouts remain detectable in future refactoring

The code is production-ready, but the additional test coverage would provide valuable insurance against future breakage.


Code Review Score: 8.5/10

  • Excellent performance optimization ⭐⭐⭐⭐⭐
  • Clean code structure ⭐⭐⭐⭐
  • Good documentation ⭐⭐⭐⭐
  • Missing targeted tests for new timeout behavior ⚠️

Suggested Action: Add timeout governance tests, then approve and merge.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reply to comment #3069768602:

You're absolutely right about the test coverage gap. The timeout governance behavior is critical for preventing CI regressions, especially with Python 3.13.

I've detailed two specific test cases in my main review that would address this:

  1. Verify mid-phase timeout checks: Mock the governor and assert check_timeout() is invoked between the explicit and keyword delimiter matching phases
  2. Verify timeout aborts keyword matching: Simulate a timeout exception during the phase transition and confirm the keyword matching phase is never entered

These tests would:

  • Prevent future regressions if someone refactors the governor threading
  • Document the intended behavior for future maintainers
  • Catch issues in CI before they cause production timeouts

The pattern follows the existing test style in test_governance.py (lines 39-49) which already uses mocking to verify timeout behavior.

@bashandbone
Copy link
Copy Markdown
Contributor

@copilot The PR comments focus on similar issues but conflict with one another in suggested implementation. I'd like you to review them all, assess which are likely to result in the cleanest, most maintainable and optimal outcome, and implement it.

Relevant discussion in:

Copilot AI and others added 2 commits April 12, 2026 17:23
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 12, 2026 17:30
Copilot AI review requested due to automatic review settings April 12, 2026 17:30
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 12, 2026 17:35
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 12, 2026 17:39
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 12, 2026

@copilot The PR comments focus on similar issues but conflict with one another in suggested implementation. I'd like you to review them all, assess which are likely to result in the cleanest, most mai...

Updates applied: consolidated nesting parsing around shared helpers, streamlined keyword hit tracking to avoid a second regex scan while keeping memory lean, aligned _calculate_nesting_level with the shared path, added a governor timeout phase-order test using a recording chunker, and ignored/removed coverage artifacts. Commits: bf5751d, 335334a, 2729b53. No UI changes.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • cas-server.xethub.hf.co
    • Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -v (dns block)
  • openaipublic.blob.core.windows.net
    • Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -m unit and not expensive and not requires_models -v /home/REDACTED/work/codeweaver/cod--pretty=format:%cI language-support/git test-codeweaver-bash down PYTEST_MARKS=&#34;unit and not expensive and not requires_models&#34; echo &#34;${CW_PREFIX} Running git desc�� --tags --abbrev=0 n/python3 :48819/health y utils/curl n/python3 (dns block)
    • Triggering command: `/home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -m unit and not expensive and not requires_models -v e/chunker/test_delimiter_edge_cases.py .py tests/unit/engine/chunker/test_delimiter_edge_cases.py pensive and not /tmp/codeql-scratch-80aa01a8d31cb554/codeql/python/tools/linux64/tsg-python curl est_delimiter_edge_cases.py git comm�� -m fix: align delimiter nesting parsing

Agent-Logs-Url: REDACTED /tsg-python -v origin PYTEST_MARKS="unit and not expensive and not requires_models"
echo "${CW_PREFIX} Running /tsg-python` (dns block)

  • Triggering command: /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/python3 /home/REDACTED/work/codeweaver/codeweaver/.venv/bin/pytest tests/ -m unit and not expensive and not requires_models -v HEAD n/sh -o errexit deweaver/engine/chunker/delimiteerrexit /bin/sh -c tracer.py git ib/jspawnhelper .py e-3268-455f-bcbadiff PYTEST_MARKS=&#34;unit and not expensive and not requires_models&#34; echo &#34;${CW_PREFIX} Running ib/jspawnhelper (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Investigating infinite loop and potential fixes for Python 3.13 CI Improve delimiter chunking performance and governor checks for Python 3.13 CI Apr 12, 2026
@bashandbone bashandbone merged commit 90d81cf into main Apr 12, 2026
14 of 15 checks passed
@bashandbone bashandbone deleted the copilot/investigate-infinite-loop-python-3-13 branch April 12, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants