Improve delimiter chunking performance and governor checks for Python 3.13 CI#335
Conversation
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/4434677c-67a3-47ee-82d6-dd1290f8b94c Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Reviewer's GuideThreads an optional resource governor into delimiter matching and replaces per-keyword nesting recomputation with a single-pass precomputation to make keyword delimiter matching linear-time and more timeout-safe. Sequence diagram for delimiter matching with resource governor and nesting precomputationsequenceDiagram
actor Governor
participant Caller
participant DelimiterChunker
participant KeywordMatcher as _match_keyword_delimiters
participant NestingPrecomp as _precompute_nesting_levels
Caller->>DelimiterChunker: _get_matches_with_fallback(content, governor)
DelimiterChunker->>Governor: check_timeout()
DelimiterChunker->>DelimiterChunker: _find_delimiter_matches(content, governor)
rect rgb(235, 245, 255)
DelimiterChunker->>DelimiterChunker: _match_explicit_delimiters(content)
end
alt governor provided
DelimiterChunker->>Governor: check_timeout()
end
rect rgb(235, 245, 255)
DelimiterChunker->>KeywordMatcher: _match_keyword_delimiters(content, keyword_delimiters)
KeywordMatcher->>NestingPrecomp: _precompute_nesting_levels(content, keyword_positions)
NestingPrecomp-->>KeywordMatcher: nesting_at
loop for each keyword match
KeywordMatcher->>KeywordMatcher: lookup nesting_at[keyword_pos]
KeywordMatcher->>KeywordMatcher: build DelimiterMatch using struct_end and nesting_level
end
end
DelimiterChunker-->>Caller: matches
opt no matches
DelimiterChunker->>DelimiterChunker: _fallback_paragraph_chunking(content)
DelimiterChunker-->>Caller: fallback matches
end
Updated class diagram for delimiter keyword matching and nesting helpersclassDiagram
class DelimiterChunker {
+_get_matches_with_fallback(content: str, governor: Any) list~DelimiterMatch~
+_enforce_chunk_limit(chunks: list~CodeChunk~, file_path: Path | None) void
+_find_delimiter_matches(content: str, governor: Any | None) list~DelimiterMatch~
+_match_keyword_delimiters(content: str, keyword_delimiters: list~Delimiter~) list~DelimiterMatch~
+_precompute_nesting_levels(content: str, positions: list~int~) dict~int, int~
+_calculate_nesting_level(content: str, pos: int) int
+_toggle_string(c: str, in_string: bool, string_char: str | None) tuple~bool, str | None~
+_skip_nesting_comment(content: str, i: int, c: str, content_len: int) int | None
+_adjust_brace_depth(c: str, depth: int) int
}
class ResourceGovernor {
+check_timeout() void
}
class DelimiterMatch
class CodeChunk
class Delimiter
class Path
DelimiterChunker ..> ResourceGovernor : uses
DelimiterChunker ..> DelimiterMatch : creates
DelimiterChunker ..> CodeChunk : manages
DelimiterChunker ..> Delimiter : matches
DelimiterChunker ..> Path : reads_path
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
🤖 Hi @Copilot, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @Copilot, but I was unable to process your request. Please see the logs for more details. |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new string-tracking logic in
_precompute_nesting_levelsonly checkscontent[i - 1] != "\\"to detect escaped quotes, which will mis-handle cases like\\"(even number of backslashes); consider using a parity check on consecutive backslashes so you don't incorrectly togglein_stringinside escaped quotes. - You call
combined_pattern.finditer(content)twice in_match_keyword_delimiters(once to buildkeyword_positionsand once in the main loop); consider collecting the matches in a list so you can both precompute positions and iterate once, avoiding a second regex pass over the content. - The new
governorparameter on_find_delimiter_matchesis typed asAny | None; if the governor interface is stable, tightening this to a protocol or concrete type withcheck_timeout()would make the contract clearer and prevent accidental misuse.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new string-tracking logic in `_precompute_nesting_levels` only checks `content[i - 1] != "\\"` to detect escaped quotes, which will mis-handle cases like `\\"` (even number of backslashes); consider using a parity check on consecutive backslashes so you don't incorrectly toggle `in_string` inside escaped quotes.
- You call `combined_pattern.finditer(content)` twice in `_match_keyword_delimiters` (once to build `keyword_positions` and once in the main loop); consider collecting the matches in a list so you can both precompute positions and iterate once, avoiding a second regex pass over the content.
- The new `governor` parameter on `_find_delimiter_matches` is typed as `Any | None`; if the governor interface is stable, tightening this to a protocol or concrete type with `check_timeout()` would make the contract clearer and prevent accidental misuse.
## Individual Comments
### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="485-489" />
<code_context>
for delimiter in keyword_delimiters:
delimiter_map.setdefault(delimiter.start, []).append(delimiter)
+ # Precompute brace-nesting levels at all keyword positions in a single
+ # O(n) forward pass. The previous approach called _calculate_nesting_level
+ # per keyword match, each scanning from position 0, resulting in O(n * m)
+ # total work that caused timeouts on large files (especially Python 3.13).
+ keyword_positions = [m.start() for m in combined_pattern.finditer(content)]
+ nesting_at = self._precompute_nesting_levels(content, keyword_positions)
+
</code_context>
<issue_to_address>
**suggestion (performance):** Avoid running the same regex twice by caching matches before computing positions and iterating.
`combined_pattern.finditer(content)` is currently called once to build `keyword_positions` and again in the main loop, doubling regex work on large files. Consider calling `finditer` once, storing the matches in a list, deriving `keyword_positions` from that list, and then reusing the same list when constructing `DelimiterMatch` instances to avoid the extra pass.
Suggested implementation:
```python
# Precompute brace-nesting levels at all keyword positions in a single
# O(n) forward pass. The previous approach called _calculate_nesting_level
# per keyword match, each scanning from position 0, resulting in O(n * m)
# total work that caused timeouts on large files (especially Python 3.13).
# Materialize matches once so we can reuse them both for position
# computation and when constructing DelimiterMatch instances.
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
```
```python
for match in keyword_matches:
```
The second SEARCH block assumes the main keyword-matching loop literally uses `for match in combined_pattern.finditer(content):`. If the surrounding code differs (e.g. assigns the iterator to a variable first or uses a comprehension), adjust that SEARCH pattern to replace whatever second invocation of `combined_pattern.finditer(content)` is used to iterate over keyword matches with `keyword_matches` instead. Ensure that:
1. The declaration of `keyword_matches = list(combined_pattern.finditer(content))` is in the same scope and precedes any use of `keyword_matches`.
2. No other code path relies on `combined_pattern.finditer(content)` being a fresh iterator (in typical usage it does not, so replacing with the precomputed list is safe).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review for PR #335
I've reviewed the performance optimization changes for delimiter-based chunking. This PR addresses Python 3.13 CI timeout issues by optimizing the keyword delimiter matching algorithm from O(n*m) to O(n+m) complexity. Overall, the changes are well-structured and achieve their goal, but there are several important improvements needed.
High Priority Issues
1. Escaped Quote Handling Bug ⚠️
Location: src/codeweaver/engine/chunker/delimiter.py:575
The escape detection logic in _precompute_nesting_levels is incorrect:
if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"):Problem: This fails for cases like \\" (even number of backslashes). The string "test\\" would incorrectly toggle the string state because content[i-1] is \, but this is actually an escaped backslash, not an escape for the quote.
Fix: Count consecutive backslashes to determine if the quote is truly escaped:
def _is_quote_escaped(content: str, pos: int) -> bool:
"""Check if quote at position is escaped by counting backslashes."""
if pos == 0:
return False
# Count consecutive backslashes before quote
backslash_count = 0
check_pos = pos - 1
while check_pos >= 0 and content[check_pos] == "\\":
backslash_count += 1
check_pos -= 1
# Odd number of backslashes means the quote is escaped
return backslash_count % 2 == 1
# Then use:
if c in ('"', "'", "`") and not self._is_quote_escaped(content, i):This same bug exists in other methods (_calculate_nesting_level:647, _update_string_state:754, _process_string_in_matching:927). All should be fixed consistently.
2. Duplicate Regex Execution 🔄
Location: src/codeweaver/engine/chunker/delimiter.py:489-492
The code calls combined_pattern.finditer(content) twice:
- Line 489: To extract positions for
_precompute_nesting_levels - Line 492: To iterate and create matches
Impact: On large files, regex matching can be expensive, effectively doubling the regex work.
Fix: Materialize matches once:
# Materialize matches once to avoid duplicate regex work
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
for match in keyword_matches: # Use materialized matches
matched_text = match.group(0)
# ... rest of loop3. Type Safety: Governor Parameter 📝
Location: src/codeweaver/engine/chunker/delimiter.py:344
The governor parameter is typed as Any | None, which defeats type checking benefits.
Current:
def _find_delimiter_matches(
self, content: str, *, governor: Any | None = None
) -> list[DelimiterMatch]:Recommended: Create a Protocol for the governor interface:
from typing import Protocol
class TimeoutCheckable(Protocol):
"""Protocol for objects that support timeout checking."""
def check_timeout(self) -> None: ...
def _find_delimiter_matches(
self, content: str, *, governor: TimeoutCheckable | None = None
) -> list[DelimiterMatch]:Looking at governance.py:33-106, ResourceGovernor has a check_timeout() method, so this protocol would be satisfied. This provides better type safety and makes the contract explicit.
Positive Aspects ✅
- Performance Impact: The O(n*m) → O(n+m) optimization is excellent and directly addresses the timeout issue
- Clean Helper Methods:
_toggle_string,_skip_nesting_comment, and_adjust_brace_depthare well-factored - Documentation: The docstring for
_precompute_nesting_levelsclearly explains the optimization rationale - Timeout Checks: Properly adding
governor.check_timeout()between phases prevents unbounded work
Additional Recommendations
-
Test Coverage: Ensure tests cover:
- Escaped quotes:
"test\\",'test\\\\', mixed escaping - Large files that previously timed out
- Edge cases in nesting precomputation
- Escaped quotes:
-
Consider Multi-character String Delimiters: The current implementation only handles single-character quotes. Languages like Python support
"""(triple quotes). This could be a follow-up enhancement.
CI Status
The review job is currently failing. Tests for Python 3.12, 3.13, and 3.14 are pending. I recommend:
- Addressing the issues above
- Waiting for CI to complete to ensure the fixes resolve the timeout issues
Summary
The core optimization strategy is sound and should resolve the Python 3.13 performance issues. However, the escaped quote handling bug and duplicate regex execution should be fixed before merging to ensure correctness and maximize the performance gains.
Recommendation: Request changes to address the three high-priority issues before approval.
🤖 Generated with Claude Code
There was a problem hiding this comment.
Pull request overview
This PR targets delimiter-based chunking performance and timeout behavior, motivated by Python 3.13 CI timeouts/infinite-loop symptoms, by introducing resource-governed execution and reducing keyword delimiter matching complexity.
Changes:
- Thread an optional
governorinto delimiter match discovery and add a mid-phasecheck_timeout()between explicit and keyword matching. - Reduce keyword delimiter nesting computation from repeated rescans to a single forward pass with O(1) nesting lookup per keyword match.
- Add helper utilities to support the new single-pass nesting precomputation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _precompute_nesting_levels( | ||
| self, content: str, positions: list[int] | ||
| ) -> dict[int, int]: | ||
| """Precompute brace-nesting levels at given positions in a single forward pass. | ||
|
|
||
| Replaces per-position calls to ``_calculate_nesting_level`` which each | ||
| scanned from position 0, yielding O(n * m) total work. This method | ||
| achieves the same result in O(n + m) by walking the content once and | ||
| recording the running brace depth at each requested position. | ||
|
|
||
| Args: | ||
| content: Source code | ||
| positions: Character offsets whose nesting level is needed | ||
|
|
||
| Returns: | ||
| Mapping from position to nesting level (0 = top-level) | ||
| """ | ||
| if not positions: | ||
| return {} | ||
|
|
||
| result: dict[int, int] = {} | ||
| sorted_positions = sorted(positions) | ||
| pos_idx = 0 | ||
| brace_depth = 0 | ||
| in_string = False | ||
| string_char: str | None = None | ||
| content_len = len(content) | ||
| i = 0 | ||
|
|
||
| while i < content_len: | ||
| # Record nesting level for every target position we have reached | ||
| while pos_idx < len(sorted_positions) and sorted_positions[pos_idx] <= i: | ||
| result[sorted_positions[pos_idx]] = brace_depth | ||
| pos_idx += 1 | ||
|
|
||
| if pos_idx >= len(sorted_positions): | ||
| break # All positions recorded | ||
|
|
||
| c = content[i] | ||
|
|
||
| # Track string boundaries | ||
| if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"): | ||
| in_string, string_char = self._toggle_string( | ||
| c, in_string=in_string, string_char=string_char | ||
| ) | ||
| elif not in_string: | ||
| skip_to = self._skip_nesting_comment(content, i, c, content_len) | ||
| if skip_to is not None: | ||
| i = skip_to | ||
| continue | ||
| brace_depth = self._adjust_brace_depth(c, brace_depth) | ||
|
|
||
| i += 1 | ||
|
|
||
| # Any remaining positions beyond the end of content | ||
| for p in sorted_positions[pos_idx:]: | ||
| result[p] = brace_depth | ||
|
|
||
| return result | ||
|
|
||
| @staticmethod | ||
| def _toggle_string( | ||
| c: str, *, in_string: bool, string_char: str | None | ||
| ) -> tuple[bool, str | None]: | ||
| """Toggle string state for quote character.""" | ||
| if not in_string: | ||
| return True, c | ||
| if c == string_char: | ||
| return False, None | ||
| return in_string, string_char | ||
|
|
||
| @staticmethod | ||
| def _skip_nesting_comment(content: str, i: int, c: str, content_len: int) -> int | None: | ||
| """Return new index if a comment starts at *i*, else None.""" | ||
| two_chars = content[i : i + 2] | ||
| if two_chars == "//" or c == "#": | ||
| nl = content.find("\n", i) | ||
| return nl if nl >= 0 else content_len | ||
| if two_chars == "/*": | ||
| end = content.find("*/", i + 2) | ||
| return end + 2 if end >= 0 else content_len | ||
| return None | ||
|
|
| @@ -501,8 +515,8 @@ def _match_keyword_delimiters( | |||
| ) | |||
|
|
|||
| if struct_end is not None: | |||
| # Calculate nesting level by counting parent structures | |||
| nesting_level = self._calculate_nesting_level(content, keyword_pos) | |||
| # Look up precomputed nesting level (O(1) per keyword) | |||
| nesting_level = nesting_at.get(keyword_pos, 0) | |||
|
|
|||
| # Create a complete match from keyword to closing structure | |||
| # This represents the entire construct (e.g., function...}) | |||
| @@ -517,6 +531,98 @@ def _match_keyword_delimiters( | |||
|
|
|||
| return matches | |||
|
|
|||
| def _precompute_nesting_levels( | |||
| self, content: str, positions: list[int] | |||
| ) -> dict[int, int]: | |||
| """Precompute brace-nesting levels at given positions in a single forward pass. | |||
|
|
|||
| Replaces per-position calls to ``_calculate_nesting_level`` which each | |||
| scanned from position 0, yielding O(n * m) total work. This method | |||
| achieves the same result in O(n + m) by walking the content once and | |||
| recording the running brace depth at each requested position. | |||
|
|
|||
| Args: | |||
| content: Source code | |||
| positions: Character offsets whose nesting level is needed | |||
|
|
|||
| Returns: | |||
| Mapping from position to nesting level (0 = top-level) | |||
| """ | |||
| if not positions: | |||
| return {} | |||
|
|
|||
| result: dict[int, int] = {} | |||
| sorted_positions = sorted(positions) | |||
| pos_idx = 0 | |||
| brace_depth = 0 | |||
| in_string = False | |||
| string_char: str | None = None | |||
| content_len = len(content) | |||
| i = 0 | |||
|
|
|||
| while i < content_len: | |||
| # Record nesting level for every target position we have reached | |||
| while pos_idx < len(sorted_positions) and sorted_positions[pos_idx] <= i: | |||
| result[sorted_positions[pos_idx]] = brace_depth | |||
| pos_idx += 1 | |||
|
|
|||
| if pos_idx >= len(sorted_positions): | |||
| break # All positions recorded | |||
|
|
|||
| c = content[i] | |||
|
|
|||
| # Track string boundaries | |||
| if c in ('"', "'", "`") and (i == 0 or content[i - 1] != "\\"): | |||
| in_string, string_char = self._toggle_string( | |||
| c, in_string=in_string, string_char=string_char | |||
| ) | |||
| elif not in_string: | |||
| skip_to = self._skip_nesting_comment(content, i, c, content_len) | |||
| if skip_to is not None: | |||
| i = skip_to | |||
| continue | |||
| brace_depth = self._adjust_brace_depth(c, brace_depth) | |||
|
|
|||
| i += 1 | |||
|
|
|||
| # Any remaining positions beyond the end of content | |||
| for p in sorted_positions[pos_idx:]: | |||
| result[p] = brace_depth | |||
|
|
|||
| return result | |||
|
|
|||
| @staticmethod | |||
| def _toggle_string( | |||
| c: str, *, in_string: bool, string_char: str | None | |||
| ) -> tuple[bool, str | None]: | |||
| """Toggle string state for quote character.""" | |||
| if not in_string: | |||
| return True, c | |||
| if c == string_char: | |||
| return False, None | |||
| return in_string, string_char | |||
|
|
|||
| @staticmethod | |||
| def _skip_nesting_comment(content: str, i: int, c: str, content_len: int) -> int | None: | |||
| """Return new index if a comment starts at *i*, else None.""" | |||
| two_chars = content[i : i + 2] | |||
| if two_chars == "//" or c == "#": | |||
| nl = content.find("\n", i) | |||
| return nl if nl >= 0 else content_len | |||
| if two_chars == "/*": | |||
| end = content.find("*/", i + 2) | |||
| return end + 2 if end >= 0 else content_len | |||
| return None | |||
|
|
|||
| @staticmethod | |||
| def _adjust_brace_depth(c: str, depth: int) -> int: | |||
| """Adjust brace depth for open/close brace characters.""" | |||
| if c == "{": | |||
| return depth + 1 | |||
| if c == "}": | |||
| return max(0, depth - 1) | |||
| return depth | |||
|
|
|||
| def _calculate_nesting_level(self, content: str, pos: int) -> int: | |||
| # Precompute brace-nesting levels at all keyword positions in a single | ||
| # O(n) forward pass. The previous approach called _calculate_nesting_level | ||
| # per keyword match, each scanning from position 0, resulting in O(n * m) | ||
| # total work that caused timeouts on large files (especially Python 3.13). | ||
| keyword_positions = [m.start() for m in combined_pattern.finditer(content)] | ||
| nesting_at = self._precompute_nesting_levels(content, keyword_positions) | ||
|
|
||
| for match in combined_pattern.finditer(content): |
| @@ -364,6 +367,10 @@ def _find_delimiter_matches(self, content: str) -> list[DelimiterMatch]: | |||
| # Phase 1: Handle explicit start/end pairs (existing logic) | |||
| matches.extend(self._match_explicit_delimiters(content, explicit_delimiters)) | |||
|
|
|||
| # Check timeout between phases to avoid unbounded work | |||
| if governor is not None: | |||
| governor.check_timeout() | |||
|
|
|||
| # Phase 2: Handle keyword delimiters with empty ends | |||
| matches.extend(self._match_keyword_delimiters(content, keyword_delimiters)) | |||
There was a problem hiding this comment.
Code Review Summary
I've reviewed PR #335 which aims to investigate and fix an infinite loop issue affecting Python 3.13 CI. The changes focus on performance optimizations in the delimiter-based chunking system.
Positive Aspects
-
Performance Optimization: The PR successfully addresses O(n * m) complexity by precomputing brace-nesting levels in a single O(n + m) forward pass via .
-
Timeout Management: Adding between phases (lines 370-372) prevents unbounded work on large files.
-
Code Organization: New helper methods (, , ) improve code readability and maintainability.
-
Well-Documented: The implementation includes clear comments explaining the algorithm and complexity improvements.
Issue to Address
Double Regex Scan (as noted in comment #3069768586):
Lines 489-492 execute combined_pattern.finditer(content) twice:
- Once to build
keyword_positions - Again in the main
forloop
For very large files, this adds an extra full regex scan. The fix is straightforward:
# Materialize matches once to avoid double regex scan
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
for match in keyword_matches:This change would maintain the performance benefits while eliminating the redundant regex execution.
Other Observations
-
CI Status: The
review / reviewcheck is currently failing. The tests for Python 3.12, 3.13, and 3.14 are pending. -
Code Correctness: The algorithm logic appears sound - the helper methods correctly handle:
- String state tracking with escape handling
- Comment skipping (single-line // and #, multi-line /* */)
- Brace depth management
-
Edge Cases: The implementation handles edge cases like positions beyond content end (line 589).
Recommendation
Please address the double regex scan issue mentioned above. Once that's fixed and the CI checks pass (particularly the Python 3.13 tests), this PR should be ready to merge.
The performance improvements are well-designed and should significantly reduce timeout issues on large files.
There was a problem hiding this comment.
Code Review Summary
I've reviewed PR #335 which aims to investigate and fix an infinite loop issue affecting Python 3.13 CI. The changes focus on performance optimizations in the delimiter-based chunking system.
Positive Aspects
-
Performance Optimization: The PR successfully addresses O(n * m) complexity by precomputing brace-nesting levels in a single O(n + m) forward pass via
_precompute_nesting_levels(). -
Timeout Management: Adding
governor.check_timeout()between phases (lines 370-372) prevents unbounded work on large files. -
Code Organization: New helper methods (
_toggle_string,_skip_nesting_comment,_adjust_brace_depth) improve code readability and maintainability. -
Well-Documented: The implementation includes clear comments explaining the algorithm and complexity improvements.
Issue to Address
Double Regex Scan (as noted in comment #3069768586):
Lines 489-492 execute combined_pattern.finditer(content) twice:
- Once to build
keyword_positions - Again in the main
forloop
For very large files, this adds an extra full regex scan. The fix is straightforward:
# Materialize matches once to avoid double regex scan
keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
for match in keyword_matches:This change would maintain the performance benefits while eliminating the redundant regex execution.
Other Observations
-
CI Status: The
review / reviewcheck is currently failing. The tests for Python 3.12, 3.13, and 3.14 are pending. -
Code Correctness: The algorithm logic appears sound - the helper methods correctly handle:
- String state tracking with escape handling
- Comment skipping (single-line // and #, multi-line /* */)
- Brace depth management
-
Edge Cases: The implementation handles edge cases like positions beyond content end (line 589).
Recommendation
Please address the double regex scan issue mentioned above. Once that's fixed and the CI checks pass (particularly the Python 3.13 tests), this PR should be ready to merge.
The performance improvements are well-designed and should significantly reduce timeout issues on large files.
|
Great catch on the double regex scan at lines 489-492. This is indeed a performance issue that should be fixed. The solution is to materialize the matches once: keyword_matches = list(combined_pattern.finditer(content))
keyword_positions = [m.start() for m in keyword_matches]
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
for match in keyword_matches:This maintains the O(n + m) optimization while avoiding the redundant regex scan on large inputs. |
Code Review - PR #335: Performance Optimization for Delimiter-Based ChunkingI've reviewed the changes in this PR which address the infinite loop issue and improve performance for Python 3.13. Here's my comprehensive analysis: Summary of ChangesThe PR implements a significant performance optimization by precomputing brace-nesting levels in a single O(n) pass instead of repeatedly scanning from position 0 for each keyword match. This transforms the algorithm from O(n * m) to O(n + m) complexity. Key changes:
Positive Aspects✅ Excellent performance improvement - The O(n * m) → O(n + m) optimization will dramatically reduce processing time for large files with many keywords ✅ Proper timeout handling - Passing the governor through to ✅ Well-documented - The docstrings clearly explain the optimization rationale and complexity improvements ✅ Maintains correctness - The precomputation achieves the same result as the previous per-position scanning approach Critical Issue: Code Duplication
|
There was a problem hiding this comment.
Performance Optimization Review
Based on my analysis of the code changes in src/codeweaver/engine/chunker/delimiter.py, here's my comprehensive review:
Summary
This PR introduces significant performance improvements to delimiter-based chunking by optimizing nesting level calculations and adding timeout checks. The changes address a critical O(n²) performance issue that caused timeouts on large files, particularly in Python 3.13.
Key Changes
1. Governor-Based Timeout Checks
- Added
governorparameter to_find_delimiter_matches() - Inserted timeout checks between delimiter matching phases
- Assessment: ✅ Good defensive programming to prevent unbounded work
2. Precomputed Nesting Levels (Lines 485-490, 534-592)
- Introduced
_precompute_nesting_levels()method that computes brace nesting in O(n) instead of O(n*m) - Replaced per-keyword
_calculate_nesting_level()calls with O(1) dictionary lookups - Assessment: ✅ Excellent optimization with clear performance benefit
3. Helper Methods
_toggle_string(): Manages string state transitions_skip_nesting_comment(): Handles comment skipping during parsing_adjust_brace_depth(): Updates brace depth counters- Assessment: ✅ Good refactoring for code clarity and reuse
Code Quality Analysis
Strengths:
- ✅ Well-documented: Excellent docstring explaining the O(n*m) → O(n+m) improvement
- ✅ Maintains correctness: The optimization doesn't change semantic behavior
- ✅ Clean separation: Helper methods are static where appropriate
- ✅ Consistent style: Follows existing code conventions
Concerns:
-
Line 606-615: Comment Detection Logic
if two_chars == "//" or c == "#":
⚠️ Mixingtwo_charsand single charcchecks feels inconsistent- Recommendation: Consider
two_chars in ("/", "##")or similar
-
Line 575-578: Escape Character Handling
if c in ('"', "'", "\`") and (i == 0 or content[i - 1] != "\\"):
⚠️ Doesn't handle escaped backslashes (\\\"should still be a boundary)- Recommendation: Consider handling consecutive backslashes
-
Line 519: Dictionary Lookup
nesting_level = nesting_at.get(keyword_pos, 0)
- Positions should always be in dict if precomputation worked correctly
- Recommendation: Consider assertion in debug mode to catch logic errors
Performance Impact
- Before: O(n * m) → quadratic scaling
- After: O(n + m) → linear scaling
- Impact: Massive improvement on large files with many keywords
CI Status
- The review workflow failed due to Gemini API quota exhaustion (HTTP 429), NOT code issues
- Python 3.12/3.13/3.14 tests are still pending
- Action needed: Wait for test results before merging
Recommendations
Before merging:
- ✅ Wait for Python test results to pass
⚠️ Add tests for escape sequence edge cases⚠️ Consider adding performance regression tests
Future improvements:
- Extract string/comment parsing to shared utility (DRY)
- Add benchmarking to track performance
- Consider caching precomputed results
Verdict
Conditional Approval ✅ (pending tests)
The code demonstrates solid engineering with a well-thought-out performance optimization. The implementation is clean, well-documented, and maintains correctness.
Once CI tests pass and the escape sequence concern is addressed, this is ready to merge.
Great work on this optimization! The precomputation approach is elegant and significantly improves performance on large files. 🚀
There was a problem hiding this comment.
PR Review Summary
This PR addresses Python 3.13 CI timeout issues through performance optimizations in delimiter-based chunking. The changes are well-designed and implement solid algorithmic improvements, but the review comment is valid - additional test coverage would strengthen confidence in the timeout governance behavior.
What This PR Does Well ✅
1. Algorithmic Optimization (Lines 475-531)
The core improvement is excellent. The PR eliminates O(n×m) complexity by precomputing brace-nesting levels in a single O(n+m) pass:
# Before: _calculate_nesting_level called per keyword match (quadratic)
nesting_level = self._calculate_nesting_level(content, keyword_pos)
# After: Single forward pass precomputes all levels (linear)
nesting_at = self._precompute_nesting_levels(content, keyword_positions)
nesting_level = nesting_at.get(keyword_pos, 0)This is a textbook performance optimization that directly addresses the root cause of timeouts on large files.
2. Clean Helper Functions (Lines 533-621)
The implementation is well-factored:
_precompute_nesting_levels: Main optimization logic_toggle_string: String state tracking_skip_nesting_comment: Comment handling_adjust_brace_depth: Brace depth management
Each helper has a single, clear responsibility and improves readability.
3. Timeout Governance Threading (Lines 342-345, 370-373)
Threading the governor through _find_delimiter_matches and adding mid-phase timeout checks is the right approach:
- Prevents unbounded work between explicit and keyword delimiter matching
- Maintains consistent timeout behavior throughout the chunking pipeline
- Uses keyword-only argument (
*, governor) for clarity
Areas for Improvement 📝
Test Coverage Gap (Addressing Comment #3069768602)
The review comment is valid and important. While the optimization logic looks correct, the new timeout governance behavior lacks targeted regression tests. Recommended additions:
Test 1: Mid-Phase Timeout Check
def test_delimiter_matching_timeout_between_phases(mock_embedding_capability, chunker_settings):
"""Verify check_timeout is called between explicit and keyword delimiter phases."""
from unittest.mock import patch, Mock
from codeweaver.engine import DelimiterChunker, ResourceGovernor
# Mock governor that tracks check_timeout calls
mock_governor = Mock(spec=ResourceGovernor)
chunker = DelimiterChunker(governor=mock_embedding_capability, settings=chunker_settings)
content = """
function foo() { }
class Bar { }
def baz(): pass
"""
# Call _find_delimiter_matches with mock governor
with patch.object(chunker._chunk_governor, 'resource_governor', mock_governor):
chunker._find_delimiter_matches(content, governor=mock_governor)
# Verify check_timeout was called at least once (between phases)
assert mock_governor.check_timeout.call_count >= 1Test 2: Timeout Aborts Before Keyword Matching
def test_delimiter_matching_aborts_on_timeout_before_keyword_phase():
"""Verify timeout during explicit phase prevents keyword phase execution."""
from unittest.mock import patch, Mock
from codeweaver.engine import DelimiterChunker, ChunkingTimeoutError
# Governor that times out between phases
mock_governor = Mock()
mock_governor.check_timeout.side_effect = ChunkingTimeoutError("Timeout")
chunker = DelimiterChunker(...)
content = "class Foo { }" # Has keyword delimiter
with pytest.raises(ChunkingTimeoutError):
chunker._find_delimiter_matches(content, governor=mock_governor)
# Keyword matching should not have been attempted
# (can verify via instrumentation or side effects)Code Quality Notes
-
Type Hint Consistency (Line 344)
governor: Any | None = None- Consider using the properResourceGovernortype instead ofAnyfor better type safety, or document whyAnyis necessary. -
Early Exit Optimization (Lines 390-391)
The early exit when all positions are recorded is good:if pos_idx >= len(sorted_positions): break # All positions recorded
Consider adding a comment explaining this prevents unnecessary iteration.
-
Documentation (Lines 534-544)
The docstring for_precompute_nesting_levelsexcellently explains the performance improvement and complexity reduction. This is great technical documentation.
Performance & Correctness ✅
- String/Comment Handling: Properly handles strings, comments, and escape sequences
- Brace Depth Safety:
max(0, depth - 1)prevents negative depth - Position Sorting: Correctly sorts positions before iteration
- Edge Cases: Handles empty position lists, positions beyond content length
Security Considerations ✅
No security concerns. The changes are purely computational optimizations for defensive tooling (code analysis). The timeout governance additions actually improve security by preventing resource exhaustion attacks via maliciously crafted input files.
Recommendation
Approve with minor request: The algorithmic improvements are excellent and should significantly reduce timeout issues in Python 3.13 CI. However, I recommend adding the suggested timeout governance tests before merging to:
- Prevent regression if the governor threading logic changes
- Document the expected timeout behavior
- Ensure CI timeouts remain detectable in future refactoring
The code is production-ready, but the additional test coverage would provide valuable insurance against future breakage.
Code Review Score: 8.5/10
- Excellent performance optimization ⭐⭐⭐⭐⭐
- Clean code structure ⭐⭐⭐⭐
- Good documentation ⭐⭐⭐⭐
- Missing targeted tests for new timeout behavior
⚠️
Suggested Action: Add timeout governance tests, then approve and merge.
There was a problem hiding this comment.
Reply to comment #3069768602:
You're absolutely right about the test coverage gap. The timeout governance behavior is critical for preventing CI regressions, especially with Python 3.13.
I've detailed two specific test cases in my main review that would address this:
- Verify mid-phase timeout checks: Mock the governor and assert
check_timeout()is invoked between the explicit and keyword delimiter matching phases - Verify timeout aborts keyword matching: Simulate a timeout exception during the phase transition and confirm the keyword matching phase is never entered
These tests would:
- Prevent future regressions if someone refactors the governor threading
- Document the intended behavior for future maintainers
- Catch issues in CI before they cause production timeouts
The pattern follows the existing test style in test_governance.py (lines 39-49) which already uses mocking to verify timeout behavior.
|
@copilot The PR comments focus on similar issues but conflict with one another in suggested implementation. I'd like you to review them all, assess which are likely to result in the cleanest, most maintainable and optimal outcome, and implement it. Relevant discussion in:
|
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/945ba2ae-3268-455f-bcba-0fbd07f80b26 Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Updates applied: consolidated nesting parsing around shared helpers, streamlined keyword hit tracking to avoid a second regex scan while keeping memory lean, aligned _calculate_nesting_level with the shared path, added a governor timeout phase-order test using a recording chunker, and ignored/removed coverage artifacts. Commits: bf5751d, 335334a, 2729b53. No UI changes. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
Agent-Logs-Url: REDACTED /tsg-python -v origin PYTEST_MARKS="unit and not expensive and not requires_models"
|
Pull request created by AI Agent
Summary by Sourcery
Improve delimiter-based chunking performance and timeout behavior when scanning large source files.
Enhancements: