adding resample benchmark objects #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

TLSDC wants to merge 2 commits into main from tlsdc/resample_benchmark

Collaborator

TLSDC commented Feb 26, 2025 •

edited by korbit-ai bot

Loading

Description by Korbit AI

What change is being made?

Add resample benchmark objects ResampleBenchmark, AllTasksBenchmark, and HighVarianceBenchmark to the custom_benchmark.py file.

Why are these changes being made?

These changes introduce a new set of benchmark classes to support different evaluation strategies within the existing experimental framework. The ResampleBenchmark class provides a base implementation for resampling experiments, while AllTasksBenchmark and HighVarianceBenchmark implement specific evaluation and selection strategies for use cases that require comprehensive task inclusion and high variance filtering, respectively. This extensibility allows tailored analysis in experimental results handling.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description


          adding resample benchmark objects

987608e

korbit-ai bot reviewed

View reviewed changes

korbit-ai bot left a comment •

edited

Loading

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Fix Detected
	Missing Task Name Error Handling ▹ view
	Hardcoded Path Exposure ▹ view
	Inefficient DataFrame Loading and Processing ▹ view
	Threshold Value Validation Missing ▹ view

Files scanned

File Path	Reviewed
src/agentlab/experiments/custom_benchmark.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.

You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.

Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.

Chat with Korbit on issues we post by tagging @korbit-ai in your reply.

Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

Check out our docs on how you can make Korbit work best for you and your team.

Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings

Setting Value

Review Schedule Automatic excluding drafts

Max Issue Count 10

Automatic PR Descriptions ✅

Issue Categories

Category Enabled

Documentation ✅

Logging ✅

Error Handling ✅

Readability ✅

Design ✅

Performance ✅

Security ✅

Functionality ✅

Feedback and Support

Tell us what you think of Korbit

Schedule a call with our team

Email us @ [email protected]

Note

Korbit Pro is free for open source projects 🎉

Looking to add Korbit to your team? Get started with a free 2 week trial here

src/agentlab/experiments/custom_benchmark.py

Comment on lines +78 to +80

+                  def evaluate(self, study: Study, env_args_list):
+                      result_df = load_result_df(study.dir)
+                      return dict(result_df.groupby("env.task_name")["cum_reward"].std())

korbit-ai bot Feb 26, 2025

Missing Task Name Error Handling

Tell me more

What is the issue?

The evaluate method in HighVarianceBenchmark may fail if a task_name in env_args_list doesn't exist in the result_df from the study.

Why this matters

This will cause a KeyError when trying to access non-existent task names in the select method, potentially crashing the benchmark creation.

Suggested change ∙ Feature Preview

Add error handling to safely handle missing task names:

def evaluate(self, study: Study, env_args_list):
    result_df = load_result_df(study.dir)
    std_dict = dict(result_df.groupby("env.task_name")["cum_reward"].std())
    # Return 0 variance for missing tasks to exclude them
    return {task.task_name: std_dict.get(task.task_name, 0) for task in env_args_list}

_{💬 Chat with Korbit by mentioning @korbit-ai.}

src/agentlab/experiments/custom_benchmark.py Outdated



		if __name__ == "__main__":
		exp_dir = Path("/home/t/agentlab_results/2025-02-26_10-15-04_genericagent-gpt-4o-mini-2024-07-18-on-miniwob-tiny-test")

korbit-ai bot Feb 26, 2025

Hardcoded Path Exposure

Tell me more

What is the issue?

Hardcoded file path in the main script block that may expose sensitive directory structure information.

Why this matters

Exposing internal directory structures can help attackers map the system layout and potentially identify access points for exploitation.

Suggested change ∙ Feature Preview

Move the path to a configuration file or environment variable:

from os import getenv
from dotenv import load_dotenv

load_dotenv()
exp_dir = Path(getenv('EXPERIMENT_DIR'))

_{💬 Chat with Korbit by mentioning @korbit-ai.}

src/agentlab/experiments/custom_benchmark.py

Comment on lines +79 to +80

		result_df = load_result_df(study.dir)
		return dict(result_df.groupby("env.task_name")["cum_reward"].std())

korbit-ai bot Feb 26, 2025

Inefficient DataFrame Loading and Processing

Tell me more

What is the issue?

Loading and processing the entire result DataFrame for each evaluation is inefficient, especially when dealing with large datasets.

Why this matters

This approach requires loading the complete dataset into memory and performing groupby operations each time evaluate() is called, which can be memory-intensive and slow for large experiment results.

Suggested change ∙ Feature Preview

Cache the processed results or pass pre-computed statistics to avoid reloading and recomputing. Consider implementing as:

def __init__(self, exp_dir: Path, threshold: float = 0.2):
    self._cached_stats = None
    super().__init__(exp_dir=exp_dir, threshold=threshold)

def _compute_stats(self, study: Study):
    if self._cached_stats is None:
        result_df = load_result_df(study.dir)
        self._cached_stats = dict(result_df.groupby("env.task_name")["cum_reward"].std())
    return self._cached_stats

def evaluate(self, study: Study, env_args_list):
    return self._compute_stats(study)

_{💬 Chat with Korbit by mentioning @korbit-ai.}

src/agentlab/experiments/custom_benchmark.py

+              @dataclass
+              class HighVarianceBenchmark(ResampleBenchmark):
+                  threshold: float = 0.2

korbit-ai bot Feb 26, 2025

Threshold Value Validation Missing

Tell me more

What is the issue?

The hardcoded threshold value of 0.2 for variance selection lacks validation to ensure it's a reasonable value.

Why this matters

Invalid threshold values (negative or extremely high) could lead to unintended task filtering behavior.

Suggested change ∙ Feature Preview

Add threshold validation in post_init:

def __post_init__(self):
    if not 0 <= self.threshold <= 1:
        raise ValueError(f"Threshold must be between 0 and 1, got {self.threshold}")
    super().__post_init__()

_{💬 Chat with Korbit by mentioning @korbit-ai.}

TLSDC marked this pull request as draft

February 26, 2025 16:18

TLSDC commented

View reviewed changes

src/agentlab/experiments/custom_benchmark.py

+                      super().__post_init__()
+                  @abstractmethod
+                  def evaluate(self, study, env_args_list):

Collaborator Author

TLSDC Feb 27, 2025

simplify this


          trying stochastic stuff

3d53b89

TLSDC commented

View reviewed changes

src/agentlab/experiments/custom_benchmark.py

Comment on lines +93 to +127

+              class StochasticHighVarianceBenchmark(ResampleBenchmark):
+                  regulation_threshold: float = 0.1
+                  total_seeds = 600
+                  min_seeds = 2
+                  random_seed = 42
+                  def evaluate(self, study: Study, env_args_list):
+                      result_df = load_result_df(study.dir)
+                      var = result_df.groupby("env.task_name")["cum_reward"].var()
+                      probs = dict((var + self.regulation_threshold) / (var + self.regulation_threshold).sum())
+                      return probs
+                  def select(self, values, env_args_list: list[EnvArgs]):
+                      selected_env_args = []
+                      max_steps = env_args_list[0].max_steps
+                      for task_name, p in values.items():
+                          # ceil to avoid missing seeds
+                          n_seeds = np.random.RandomState(self.random_seed).poisson(p * self.total_seeds)
+                          n_seeds = max(n_seeds, self.min_seeds)
+                          for seed in np.random.RandomState(self.random_seed).randint(0, 2**32, n_seeds):
+                              selected_env_args.append(
+                                  EnvArgs(
+                                      task_name=task_name,
+                                      task_seed=int(seed),
+                                      max_steps=max_steps,
+                                      headless=True,
+                                      record_video=False,
+                                      wait_for_user_message=False,
+                                      viewport=None,
+                                      slow_mo=None,
+                                      storage_state=None,
+                                      task_kwargs=None,
+                                  )
+                              )
+                      return selected_env_args

Collaborator Author

TLSDC Mar 4, 2025

probs is more like a proportion of benchmark allocated to this task.

regulation_threshold and min_seeds kind of act the same way, in a soft and hard version respectively. It could be interesting to play around with the regulation threshold.

total_seeds gives control over the size of the benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet