-
Notifications
You must be signed in to change notification settings - Fork 117
feature/parallel runners #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce support for parallel scenario evaluation runs by adding a Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant UI
participant ConfigScreen
participant ScenarioRunner
participant ScenarioEvaluationService
participant LLMService
User->>UI: Load App
UI->>ConfigScreen: Show config (includes parallel_runs slider)
User->>ConfigScreen: Set parallel_runs and save config
ConfigScreen->>UI: Update shared state with parallel_runs
User->>ScenarioRunner: Click "Run" to evaluate scenarios
ScenarioRunner->>ScenarioRunner: Split scenarios into batches (parallel_runs)
loop For each batch
ScenarioRunner->>ScenarioEvaluationService: Start async worker for batch
ScenarioEvaluationService->>ScenarioRunner: Yield status/chat updates
end
ScenarioRunner->>ScenarioEvaluationService: Evaluate prompt injection scenarios asynchronously
ScenarioEvaluationService->>ScenarioRunner: Yield prompt injection evaluation updates
ScenarioRunner->>ScenarioRunner: Merge all EvaluationResults
ScenarioRunner->>LLMService: Generate summary from combined results
ScenarioRunner->>UI: Update UI with results, summary, and status per run
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces parallel scenario evaluation to improve testing speed and UI responsiveness by running multiple evaluation tasks concurrently. Key changes include adding a parallel_runs configuration in the config screen and AgentConfig, refactoring the scenario runner UI and logic to manage multiple concurrent evaluation tasks, and updating the evaluation service to aggregate results from parallel runners.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| rogue/ui/components/scenario_runner.py | Updated to implement parallel evaluation with batched scenarios and multiple UI output groups. |
| rogue/ui/components/config_screen.py | Added a new slider for parallel_runs and updated configuration logic. |
| rogue/ui/app.py | Adjusted to pass the parallel_runs value through the app components. |
| rogue/services/scenario_evaluation_service.py | Updated to aggregate evaluation results using the new combine method. |
| rogue/models/evaluation_result.py | Revised the combine method in EvaluationResults for merging results. |
| rogue/models/config.py | Introduced the parallel_runs field in AgentConfig. |
| parallel_evaluation_plan.md | Added documentation for the new parallel evaluation architecture. |
| .pre-commit-config.yaml | Minor formatting configuration changes for pre-commit hooks. |
Comments suppressed due to low confidence (1)
.pre-commit-config.yaml:15
- [nitpick] Removing the language_version for the Black hook may lead to inconsistent formatting if different environments use varying Python versions. It might be beneficial to specify the version to ensure consistency.
- id: black
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
rogue/ui/components/scenario_runner.py (1)
158-161: Store and await worker tasks for proper task management.The current implementation creates tasks but doesn't store or await them, which can lead to poor error propagation and potential task loss.
Apply this fix to properly manage the worker tasks:
- [ - asyncio.create_task(worker(batch, i)) - for i, batch in enumerate(scenario_batches) - ] + worker_tasks = [ + asyncio.create_task(worker(batch, i)) + for i, batch in enumerate(scenario_batches) + ]Consider adding task cleanup after the main processing loop:
# After line 190, add: # Cancel any remaining tasks in case of early termination for task in worker_tasks: if not task.done(): task.cancel() # Await all tasks to ensure proper cleanup await asyncio.gather(*worker_tasks, return_exceptions=True)
🧹 Nitpick comments (2)
.pre-commit-config.yaml (1)
21-24: Optional: tighten Bandit executionBandit will run with its default profile (
-lll).
If you already maintain a customisedbandit.ymlor wish to fail the hook only on high-severity issues, consider pinning the configuration under the hook:- id: bandit + args: ["-c", "path/to/bandit.yml", "-lll"]Not critical, but makes the security scan predictable across developers.
rogue/ui/components/config_screen.py (1)
83-90: Add user guidance for parallel runs configuration.The parallel runs slider is well-implemented, but consider adding explanatory text to help users understand what this setting does and potential resource implications of higher values.
Consider adding guidance similar to other settings:
parallel_runs = gr.Slider( label="Number of parallel evaluation runs", minimum=1, maximum=10, step=1, value=config_data.get("parallel_runs", 1), ) + gr.Markdown( + "Controls how many scenario evaluations run concurrently. " + "Higher values may improve performance but consume more resources." + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (7)
.pre-commit-config.yaml(1 hunks)rogue/models/config.py(1 hunks)rogue/models/evaluation_result.py(1 hunks)rogue/services/scenario_evaluation_service.py(1 hunks)rogue/ui/app.py(3 hunks)rogue/ui/components/config_screen.py(6 hunks)rogue/ui/components/scenario_runner.py(2 hunks)
🔇 Additional comments (7)
.pre-commit-config.yaml (1)
17-20: Verify that mypy still picks up your project configurationThe
args: ["--config-file=..."]stanza was removed.
If your codebase keeps its type-checker settings exclusively inpyproject.toml/mypy.ini, nothing breaks.
However, if the config file lives elsewhere (e.g.configs/mypy.ini) or you rely on extra CLI flags, the hook will now run with mypy’s defaults and may flood CI with false positives.Would you double-check by running
pre-commit run mypy --all-fileslocally and confirm it honours the intended configuration? If it doesn’t, re-add the explicitargskey:- id: mypy + args: ["--config-file=path/to/mypy.ini"]rogue/models/config.py (1)
22-22: LGTM! Clean addition of parallel runs configuration.The new
parallel_runsfield is well-implemented with an appropriate default value of 1, ensuring backward compatibility and safe defaults.rogue/services/scenario_evaluation_service.py (1)
78-84: Nit: Remove redundantall_resultsassignment
- The
all_results = self._resultsline is unnecessary—you can referenceself._resultsdirectly in both the write and the final yield.- I searched the codebase for any consumers expecting a
"results"status fromevaluate_scenariosand found none. Changing the final yield to"done"is safe and won’t break existing usage.No further action required.
rogue/ui/app.py (1)
42-42: LGTM! Consistent integration of parallel_runs parameter.The parallel_runs parameter is properly integrated into the UI state management following the same pattern as other configuration parameters.
Also applies to: 118-118, 148-148
rogue/ui/components/config_screen.py (1)
215-239: LGTM! Proper integration into configuration save/load logic.The parallel_runs parameter is correctly integrated into the configuration saving, validation, and component state management.
Also applies to: 294-320
rogue/models/evaluation_result.py (1)
24-32: Ensure simplified result handling covers all use casesBriefly, searches for external calls to the old
combineclass method and for scenario-based uses ofadd_resultreturned no hits. However, absence of references doesn’t guarantee that no parts of the codebase (or tests) rely on recursive merging or deduplication. Please manually verify:
- rogue/models/evaluation_result.py (lines 24–32):
add_result: confirm no existing logic elsewhere depends on automatic merging or deduplication.combine: ensure no downstream modules expect recursive flattening of nestedEvaluationResults.- All test suites: verify that duplicate results aren’t being introduced or that existing tests cover parallel/duplicate-scenario evaluations.
- Any data-processing pipelines or consumers of
EvaluationResults: check for assumptions about result uniqueness or structure.rogue/ui/components/scenario_runner.py (1)
20-26: Well-implemented batch splitting function.The function correctly handles edge cases and uses numpy's array_split for even distribution of scenarios across batches.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
rogue/ui/components/scenario_runner.py (2)
158-162: LGTM: Proper use of asyncio.gather for concurrent execution.The implementation correctly uses
asyncio.gather()to run worker tasks concurrently, which addresses the previous review suggestion.
28-28: Breaking change: Function signature modified.This is a duplicate of the previous review comment. The return signature was changed from 3 to 2 values, requiring updates to callers.
Also applies to: 220-220
🧹 Nitpick comments (2)
rogue/ui/components/scenario_runner.py (2)
20-25: Handle edge case for empty batch results.The function correctly validates inputs, but consider the edge case where
np.array_splitmight create empty arrays for very small scenario lists with largenvalues.- return [arr.tolist() for arr in np.array_split(scenarios, n) if arr.size > 0] + batches = [arr.tolist() for arr in np.array_split(scenarios, n) if arr.size > 0] + return batches if batches else [scenarios] if scenarios else []
169-191: Add timeout protection for queue processing.The queue processing loop could potentially hang if a worker fails to send the "done" signal. Consider adding a timeout mechanism.
while finished_workers < num_runners: - worker_id, update_type, data = await update_queue.get() + try: + worker_id, update_type, data = await asyncio.wait_for( + update_queue.get(), timeout=300 # 5 minute timeout + ) + except asyncio.TimeoutError: + logger.error("Timeout waiting for worker updates") + break
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (1)
rogue/ui/components/scenario_runner.py(2 hunks)
🔇 Additional comments (3)
rogue/ui/components/scenario_runner.py (3)
2-3: LGTM: Required imports for parallel execution.The asyncio and numpy imports are appropriately added to support the new parallel execution functionality.
17-18: ✅ MAX_PARALLEL_RUNS aligns with UI slider limitVerified that the
parallel_runsslider inrogue/ui/components/config_screen.pyis configured withmaximum=10, matchingMAX_PARALLEL_RUNS = 10. No further changes required.
185-186: EvaluationResults.combine modifies self in-place—no changes requiredThe
combinemethod inrogue/models/evaluation_result.pyextendsself.resultswithout returning a new instance, so callingall_results.combine(data)correctly merges results in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
rogue/ui/components/scenario_runner.py (4)
37-59: Magic number calculations reduce maintainability.The indexing logic using
i * 3,i * 3 + 1, etc. makes the code hard to maintain and was flagged in previous reviews.Consider defining a constant like
COMPONENTS_PER_RUN = 3and using structured indexing to improve code clarity.
94-98: Repeated magic number calculations are error-prone.The UI update logic with hardcoded index calculations is repeated throughout the code and was flagged in previous reviews.
Extract this logic into helper functions to improve maintainability and reduce the risk of indexing errors.
219-219: Function signature change affects callers.The return signature changed from
(scenarios_display, status_box, run_button)to(scenarios_display, run_button), which was flagged in previous reviews as affecting callers inrogue/ui/app.py.Ensure all callers are updated to handle the new signature.
158-161: Critical: Tasks created but not properly managed.The list comprehension creates asyncio tasks but doesn't assign them to a variable or await them properly. This can lead to tasks running without proper lifecycle management and potential resource leaks.
Apply this fix to properly manage the worker tasks:
- [ - asyncio.create_task(worker(batch, i)) - for i, batch in enumerate(scenario_batches) - ] + worker_tasks = [ + asyncio.create_task(worker(batch, i)) + for i, batch in enumerate(scenario_batches) + ] + + # Start all tasks concurrently + try: + await asyncio.gather(*worker_tasks, return_exceptions=True) + except Exception as e: + logger.error(f"Error in worker tasks: {e}") + # Cancel remaining tasks + for task in worker_tasks: + if not task.done(): + task.cancel()
🧹 Nitpick comments (1)
rogue/ui/components/scenario_runner.py (1)
17-17: Consider making MAX_PARALLEL_RUNS configurable.While a constant is good for setting an upper limit, consider whether this should be configurable via environment variables or configuration files for different deployment scenarios.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (1)
rogue/ui/components/scenario_runner.py(2 hunks)
🔇 Additional comments (2)
rogue/ui/components/scenario_runner.py (2)
2-3: LGTM: Required imports for parallel execution.The addition of
asyncioandnumpyimports is appropriate for the new parallel execution functionality.
20-25: LGTM: Well-implemented batch splitting function.The function correctly handles edge cases (empty list, non-positive n) and uses numpy for efficient array splitting. The filtering of empty arrays ensures no worker gets an empty batch.
| visibility_updates[i * 3] = gr.update(visible=True) | ||
| yield tuple(visibility_updates) | ||
|
|
||
| # 3. --- Define and Run Worker Tasks --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should move to the scenario evaluation service (most of the logic in this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go for another pr on that, this one is getting a bit too long
* --wip-- [skip ci] * --wip-- [skip ci] * --wip-- [skip ci] * --wip-- [skip ci] * Some fixes * mypy fixes --------- Co-authored-by: Yuval <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
rogue/ui/components/scenario_runner.py (3)
113-116: Address magic number calculations in UI component indexing.The code still uses magic numbers (
i * 3,i * 3 + 1,i * 3 + 2) for component indexing, making it hard to maintain.+COMPONENTS_PER_RUN = 3 # group, status_box, chat_display +GROUP_IDX_OFFSET = 0 +STATUS_IDX_OFFSET = 1 +CHAT_IDX_OFFSET = 2 + +def get_component_indices(worker_id): + base = worker_id * COMPONENTS_PER_RUN + return base + GROUP_IDX_OFFSET, base + STATUS_IDX_OFFSET, base + CHAT_IDX_OFFSET # Then use throughout: -initial_updates[i * 3] = gr.update(visible=False) -initial_updates[i * 3 + 1] = gr.update(value="", visible=True) -initial_updates[i * 3 + 2] = gr.update(value=None, visible=True) +group_idx, status_idx, chat_idx = get_component_indices(i) +initial_updates[group_idx] = gr.update(visible=False) +initial_updates[status_idx] = gr.update(value="", visible=True) +initial_updates[chat_idx] = gr.update(value=None, visible=True)Also applies to: 182-182, 188-190, 195-197
167-170: Critical: Worker tasks are not awaited.The tasks are created but not stored or awaited, causing the code to proceed without waiting for completion.
-[ - asyncio.create_task(worker(batch, i)) - for i, batch in enumerate(scenario_batches) -] +worker_tasks = [ + asyncio.create_task(worker(batch, i)) + for i, batch in enumerate(scenario_batches) +] +# The tasks will be awaited implicitly through the queue processing loop
177-198: Add timeout protection to queue processing.The loop could hang indefinitely if a worker crashes before sending "done".
while finished_workers < num_runners: - worker_id, update_type, data = await update_queue.get() + try: + worker_id, update_type, data = await asyncio.wait_for( + update_queue.get(), timeout=300 # 5 minutes timeout + ) + except asyncio.TimeoutError: + logger.error("Timeout waiting for worker updates") + break
🧹 Nitpick comments (4)
rogue/services/scenario_evaluation_service.py (1)
20-20: Consider making MAX_SAMPLES configurable.The hardcoded limit of 10 samples might be too restrictive for comprehensive testing. Consider making this configurable through the service constructor or configuration.
rogue/prompt_injection_evaluator/run_prompt_injection_evaluator.py (3)
54-75: Consider improving data part handling.When handling data parts,
json.dumpsmight produce verbose output. Consider adding a size limit or summary representation for large data objects.elif p.root.kind == "data": - text += json.dumps(p.root.data) + data_str = json.dumps(p.root.data) + if len(data_str) > 1000: + text += f"[Data object: {len(data_str)} chars]" + else: + text += data_str
94-94: Address TODO: Use Pydantic model for response format.Using a Pydantic model would provide better type safety and validation for the LLM response.
Would you like me to generate a Pydantic model for the response format and update the completion call?
165-165: Optimize logging with lazy formatting.Avoid using f-strings in logging statements for better performance. The logger can format the message lazily.
-logger.info(f"Response: {response}", extra={"response": response}) +logger.info("Response: %s", response, extra={"response": response})
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (9)
rogue/__init__.py(1 hunks)rogue/common/remote_agent_connection.py(3 hunks)rogue/models/config.py(2 hunks)rogue/models/prompt_injection.py(1 hunks)rogue/prompt_injection_evaluator/__init__.py(1 hunks)rogue/prompt_injection_evaluator/run_prompt_injection_evaluator.py(1 hunks)rogue/services/llm_service.py(2 hunks)rogue/services/scenario_evaluation_service.py(3 hunks)rogue/ui/components/scenario_runner.py(3 hunks)
✅ Files skipped from review due to trivial changes (3)
- rogue/init.py
- rogue/prompt_injection_evaluator/init.py
- rogue/models/prompt_injection.py
🚧 Files skipped from review as they are similar to previous changes (2)
- rogue/services/llm_service.py
- rogue/models/config.py
🧰 Additional context used
🧠 Learnings (2)
rogue/ui/components/scenario_runner.py (1)
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/taskmaster.mdc:0-0
Timestamp: 2025-06-23T12:36:58.787Z
Learning: Taskmaster's AI-powered tools (parse_prd, analyze_project_complexity, update_subtask, update_task, update, expand_all, expand_task, add_task) may take up to a minute to complete due to AI processing; always inform users to wait during these operations.
rogue/common/remote_agent_connection.py (1)
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/tech-stack.mdc:0-0
Timestamp: 2025-06-23T12:37:02.487Z
Learning: Use httpx for making HTTP requests.
🔇 Additional comments (3)
rogue/common/remote_agent_connection.py (1)
1-1: LGTM! Good use of explicit type aliases.Using
TypeAliasmakes the type aliases explicit and improves code clarity. This is the recommended approach for Python 3.10+.Also applies to: 32-45, 47-48
rogue/prompt_injection_evaluator/run_prompt_injection_evaluator.py (1)
24-51: LGTM! Well-structured evaluation prompt.The prompt template clearly defines the evaluation criteria and expected output format.
rogue/ui/components/scenario_runner.py (1)
18-38: LGTM! Well-implemented batch splitting logic.The function correctly handles edge cases including empty lists, remainder distribution, and avoids empty batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
rogue/evaluator_agent/run_evaluator_agent.py (1)
92-92: LGTM! Clean replacement of local function with centralized utility.The use of
get_auth_headercorrectly replaces the local header construction logic with the centralized implementation.For consistency, consider aligning the type annotation for
auth_credentialsparameter with the imported function:- auth_credentials: str | None, + auth_credentials: Optional[str],Both
str | NoneandOptional[str]are functionally equivalent, but using consistent typing across related functions improves readability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite
📒 Files selected for processing (2)
rogue/evaluator_agent/run_evaluator_agent.py(2 hunks)rogue/models/config.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- rogue/models/config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
rogue/evaluator_agent/run_evaluator_agent.py (1)
rogue/models/config.py (2)
AuthType(7-11)get_auth_header(40-52)
🔇 Additional comments (1)
rogue/evaluator_agent/run_evaluator_agent.py (1)
18-18: LGTM! Good refactoring to centralize authentication logic.The import of
get_auth_headeraligns with the centralization of authentication header construction logic.
Summary by CodeRabbit
New Features
Refactor
Style
Chores
Enhancements