-
Notifications
You must be signed in to change notification settings - Fork 76
adding resample benchmark objects #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Fix Detected |
---|---|---|
Missing Task Name Error Handling ▹ view | ||
Hardcoded Path Exposure ▹ view | ||
Inefficient DataFrame Loading and Processing ▹ view | ||
Threshold Value Validation Missing ▹ view |
Files scanned
File Path | Reviewed |
---|---|
src/agentlab/experiments/custom_benchmark.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Need a new review? Comment
/korbit-review
on this PR and I'll review your latest changes.Korbit Guide: Usage and Customization
Interacting with Korbit
- You can manually ask Korbit to review your PR using the
/korbit-review
command in a comment at the root of your PR.- You can ask Korbit to generate a new PR description using the
/korbit-generate-pr-description
command in any comment on your PR.- Too many Korbit comments? I can resolve all my comment threads if you use the
/korbit-resolve
command in any comment on your PR.- Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
- Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.
Customizing Korbit
- Check out our docs on how you can make Korbit work best for you and your team.
- Customize Korbit for your organization through the Korbit Console.
Current Korbit Configuration
General Settings
Setting Value Review Schedule Automatic excluding drafts Max Issue Count 10 Automatic PR Descriptions ✅ Issue Categories
Category Enabled Documentation ✅ Logging ✅ Error Handling ✅ Readability ✅ Design ✅ Performance ✅ Security ✅ Functionality ✅ Feedback and Support
Note
Korbit Pro is free for open source projects 🎉
Looking to add Korbit to your team? Get started with a free 2 week trial here
def evaluate(self, study: Study, env_args_list): | ||
result_df = load_result_df(study.dir) | ||
return dict(result_df.groupby("env.task_name")["cum_reward"].std()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Task Name Error Handling 
Tell me more
What is the issue?
The evaluate method in HighVarianceBenchmark may fail if a task_name in env_args_list doesn't exist in the result_df from the study.
Why this matters
This will cause a KeyError when trying to access non-existent task names in the select method, potentially crashing the benchmark creation.
Suggested change ∙ Feature Preview
Add error handling to safely handle missing task names:
def evaluate(self, study: Study, env_args_list):
result_df = load_result_df(study.dir)
std_dict = dict(result_df.groupby("env.task_name")["cum_reward"].std())
# Return 0 variance for missing tasks to exclude them
return {task.task_name: std_dict.get(task.task_name, 0) for task in env_args_list}
💬 Chat with Korbit by mentioning @korbit-ai.
|
||
|
||
if __name__ == "__main__": | ||
exp_dir = Path("/home/t/agentlab_results/2025-02-26_10-15-04_genericagent-gpt-4o-mini-2024-07-18-on-miniwob-tiny-test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded Path Exposure 
Tell me more
What is the issue?
Hardcoded file path in the main script block that may expose sensitive directory structure information.
Why this matters
Exposing internal directory structures can help attackers map the system layout and potentially identify access points for exploitation.
Suggested change ∙ Feature Preview
Move the path to a configuration file or environment variable:
from os import getenv
from dotenv import load_dotenv
load_dotenv()
exp_dir = Path(getenv('EXPERIMENT_DIR'))
💬 Chat with Korbit by mentioning @korbit-ai.
result_df = load_result_df(study.dir) | ||
return dict(result_df.groupby("env.task_name")["cum_reward"].std()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient DataFrame Loading and Processing 
Tell me more
What is the issue?
Loading and processing the entire result DataFrame for each evaluation is inefficient, especially when dealing with large datasets.
Why this matters
This approach requires loading the complete dataset into memory and performing groupby operations each time evaluate() is called, which can be memory-intensive and slow for large experiment results.
Suggested change ∙ Feature Preview
Cache the processed results or pass pre-computed statistics to avoid reloading and recomputing. Consider implementing as:
def __init__(self, exp_dir: Path, threshold: float = 0.2):
self._cached_stats = None
super().__init__(exp_dir=exp_dir, threshold=threshold)
def _compute_stats(self, study: Study):
if self._cached_stats is None:
result_df = load_result_df(study.dir)
self._cached_stats = dict(result_df.groupby("env.task_name")["cum_reward"].std())
return self._cached_stats
def evaluate(self, study: Study, env_args_list):
return self._compute_stats(study)
💬 Chat with Korbit by mentioning @korbit-ai.
|
||
@dataclass | ||
class HighVarianceBenchmark(ResampleBenchmark): | ||
threshold: float = 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threshold Value Validation Missing 
Tell me more
What is the issue?
The hardcoded threshold value of 0.2 for variance selection lacks validation to ensure it's a reasonable value.
Why this matters
Invalid threshold values (negative or extremely high) could lead to unintended task filtering behavior.
Suggested change ∙ Feature Preview
Add threshold validation in post_init:
def __post_init__(self):
if not 0 <= self.threshold <= 1:
raise ValueError(f"Threshold must be between 0 and 1, got {self.threshold}")
super().__post_init__()
💬 Chat with Korbit by mentioning @korbit-ai.
super().__post_init__() | ||
|
||
@abstractmethod | ||
def evaluate(self, study, env_args_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplify this
class StochasticHighVarianceBenchmark(ResampleBenchmark): | ||
regulation_threshold: float = 0.1 | ||
total_seeds = 600 | ||
min_seeds = 2 | ||
random_seed = 42 | ||
|
||
def evaluate(self, study: Study, env_args_list): | ||
result_df = load_result_df(study.dir) | ||
var = result_df.groupby("env.task_name")["cum_reward"].var() | ||
probs = dict((var + self.regulation_threshold) / (var + self.regulation_threshold).sum()) | ||
return probs | ||
|
||
def select(self, values, env_args_list: list[EnvArgs]): | ||
selected_env_args = [] | ||
max_steps = env_args_list[0].max_steps | ||
for task_name, p in values.items(): | ||
# ceil to avoid missing seeds | ||
n_seeds = np.random.RandomState(self.random_seed).poisson(p * self.total_seeds) | ||
n_seeds = max(n_seeds, self.min_seeds) | ||
for seed in np.random.RandomState(self.random_seed).randint(0, 2**32, n_seeds): | ||
selected_env_args.append( | ||
EnvArgs( | ||
task_name=task_name, | ||
task_seed=int(seed), | ||
max_steps=max_steps, | ||
headless=True, | ||
record_video=False, | ||
wait_for_user_message=False, | ||
viewport=None, | ||
slow_mo=None, | ||
storage_state=None, | ||
task_kwargs=None, | ||
) | ||
) | ||
return selected_env_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probs
is more like a proportion of benchmark allocated to this task.
regulation_threshold
and min_seeds
kind of act the same way, in a soft and hard version respectively. It could be interesting to play around with the regulation threshold.
total_seeds
gives control over the size of the benchmark.
Description by Korbit AI
What change is being made?
Add resample benchmark objects
ResampleBenchmark
,AllTasksBenchmark
, andHighVarianceBenchmark
to thecustom_benchmark.py
file.Why are these changes being made?
These changes introduce a new set of benchmark classes to support different evaluation strategies within the existing experimental framework. The
ResampleBenchmark
class provides a base implementation for resampling experiments, whileAllTasksBenchmark
andHighVarianceBenchmark
implement specific evaluation and selection strategies for use cases that require comprehensive task inclusion and high variance filtering, respectively. This extensibility allows tailored analysis in experimental results handling.