Skip to content

adding resample benchmark objects #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

TLSDC
Copy link
Collaborator

@TLSDC TLSDC commented Feb 26, 2025

Description by Korbit AI

What change is being made?

Add resample benchmark objects ResampleBenchmark, AllTasksBenchmark, and HighVarianceBenchmark to the custom_benchmark.py file.

Why are these changes being made?

These changes introduce a new set of benchmark classes to support different evaluation strategies within the existing experimental framework. The ResampleBenchmark class provides a base implementation for resampling experiments, while AllTasksBenchmark and HighVarianceBenchmark implement specific evaluation and selection strategies for use cases that require comprehensive task inclusion and high variance filtering, respectively. This extensibility allows tailored analysis in experimental results handling.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Fix Detected
Functionality Missing Task Name Error Handling ▹ view
Security Hardcoded Path Exposure ▹ view
Performance Inefficient DataFrame Loading and Processing ▹ view
Readability Threshold Value Validation Missing ▹ view
Files scanned
File Path Reviewed
src/agentlab/experiments/custom_benchmark.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings
Setting Value
Review Schedule Automatic excluding drafts
Max Issue Count 10
Automatic PR Descriptions
Issue Categories
Category Enabled
Documentation
Logging
Error Handling
Readability
Design
Performance
Security
Functionality

Feedback and Support

Note

Korbit Pro is free for open source projects 🎉

Looking to add Korbit to your team? Get started with a free 2 week trial here

Comment on lines +78 to +80
def evaluate(self, study: Study, env_args_list):
result_df = load_result_df(study.dir)
return dict(result_df.groupby("env.task_name")["cum_reward"].std())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Task Name Error Handling category Functionality

Tell me more
What is the issue?

The evaluate method in HighVarianceBenchmark may fail if a task_name in env_args_list doesn't exist in the result_df from the study.

Why this matters

This will cause a KeyError when trying to access non-existent task names in the select method, potentially crashing the benchmark creation.

Suggested change ∙ Feature Preview

Add error handling to safely handle missing task names:

def evaluate(self, study: Study, env_args_list):
    result_df = load_result_df(study.dir)
    std_dict = dict(result_df.groupby("env.task_name")["cum_reward"].std())
    # Return 0 variance for missing tasks to exclude them
    return {task.task_name: std_dict.get(task.task_name, 0) for task in env_args_list}

Report a problem with this comment

💬 Chat with Korbit by mentioning @korbit-ai.



if __name__ == "__main__":
exp_dir = Path("/home/t/agentlab_results/2025-02-26_10-15-04_genericagent-gpt-4o-mini-2024-07-18-on-miniwob-tiny-test")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded Path Exposure category Security

Tell me more
What is the issue?

Hardcoded file path in the main script block that may expose sensitive directory structure information.

Why this matters

Exposing internal directory structures can help attackers map the system layout and potentially identify access points for exploitation.

Suggested change ∙ Feature Preview

Move the path to a configuration file or environment variable:

from os import getenv
from dotenv import load_dotenv

load_dotenv()
exp_dir = Path(getenv('EXPERIMENT_DIR'))

Report a problem with this comment

💬 Chat with Korbit by mentioning @korbit-ai.

Comment on lines +79 to +80
result_df = load_result_df(study.dir)
return dict(result_df.groupby("env.task_name")["cum_reward"].std())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient DataFrame Loading and Processing category Performance

Tell me more
What is the issue?

Loading and processing the entire result DataFrame for each evaluation is inefficient, especially when dealing with large datasets.

Why this matters

This approach requires loading the complete dataset into memory and performing groupby operations each time evaluate() is called, which can be memory-intensive and slow for large experiment results.

Suggested change ∙ Feature Preview

Cache the processed results or pass pre-computed statistics to avoid reloading and recomputing. Consider implementing as:

def __init__(self, exp_dir: Path, threshold: float = 0.2):
    self._cached_stats = None
    super().__init__(exp_dir=exp_dir, threshold=threshold)

def _compute_stats(self, study: Study):
    if self._cached_stats is None:
        result_df = load_result_df(study.dir)
        self._cached_stats = dict(result_df.groupby("env.task_name")["cum_reward"].std())
    return self._cached_stats

def evaluate(self, study: Study, env_args_list):
    return self._compute_stats(study)

Report a problem with this comment

💬 Chat with Korbit by mentioning @korbit-ai.


@dataclass
class HighVarianceBenchmark(ResampleBenchmark):
threshold: float = 0.2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold Value Validation Missing category Readability

Tell me more
What is the issue?

The hardcoded threshold value of 0.2 for variance selection lacks validation to ensure it's a reasonable value.

Why this matters

Invalid threshold values (negative or extremely high) could lead to unintended task filtering behavior.

Suggested change ∙ Feature Preview

Add threshold validation in post_init:

def __post_init__(self):
    if not 0 <= self.threshold <= 1:
        raise ValueError(f"Threshold must be between 0 and 1, got {self.threshold}")
    super().__post_init__()

Report a problem with this comment

💬 Chat with Korbit by mentioning @korbit-ai.

@TLSDC TLSDC marked this pull request as draft February 26, 2025 16:18
super().__post_init__()

@abstractmethod
def evaluate(self, study, env_args_list):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this

Comment on lines +93 to +127
class StochasticHighVarianceBenchmark(ResampleBenchmark):
regulation_threshold: float = 0.1
total_seeds = 600
min_seeds = 2
random_seed = 42

def evaluate(self, study: Study, env_args_list):
result_df = load_result_df(study.dir)
var = result_df.groupby("env.task_name")["cum_reward"].var()
probs = dict((var + self.regulation_threshold) / (var + self.regulation_threshold).sum())
return probs

def select(self, values, env_args_list: list[EnvArgs]):
selected_env_args = []
max_steps = env_args_list[0].max_steps
for task_name, p in values.items():
# ceil to avoid missing seeds
n_seeds = np.random.RandomState(self.random_seed).poisson(p * self.total_seeds)
n_seeds = max(n_seeds, self.min_seeds)
for seed in np.random.RandomState(self.random_seed).randint(0, 2**32, n_seeds):
selected_env_args.append(
EnvArgs(
task_name=task_name,
task_seed=int(seed),
max_steps=max_steps,
headless=True,
record_video=False,
wait_for_user_message=False,
viewport=None,
slow_mo=None,
storage_state=None,
task_kwargs=None,
)
)
return selected_env_args
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probs is more like a proportion of benchmark allocated to this task.

regulation_threshold and min_seeds kind of act the same way, in a soft and hard version respectively. It could be interesting to play around with the regulation threshold.

total_seeds gives control over the size of the benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant