Skip to content

fix: manipulaiton bench fixes #653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

jmatejcz
Copy link
Contributor

@jmatejcz jmatejcz commented Jul 25, 2025

Purpose

Fixing couple small issues that were occuring during execution:

  • Some scenarios didn't have proper level assigned
  • Errors were catched in a manner that was not informative and lead to further execution
  • Fixed response returned by MoveObjectFromToTool

Proposed Changes

  • Now all scenarios have proper level assignment in predefined scenarios
  • Errors are now catched in test_models function instead of in run_next. Before that when an unexpected exception occured, it would affect the single scenario execution. After that the next scenario would execute as planned, but that was the problem, because usually the problem was outside of scenario scope and would reoccur in every scenario till the end - for example cuda OOM. In such case it was a wasted time and the avg scores were impared by the 0.0 scores that were result of error.
    Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment.

Issues

Testing

Just run manipulation benchmark

python src/rai_bench/rai_bench/examples/benchmarking_models.py

you can check out the logs to see that levels are assigned properly

@jmatejcz jmatejcz changed the base branch from main to development July 25, 2025 08:26
@jmatejcz jmatejcz marked this pull request as ready for review July 25, 2025 08:38
@Juliaj
Copy link
Collaborator

Juliaj commented Jul 27, 2025

@jmatejcz , I checked out this pull request and followed steps from rai_bench README to run benchmark test. I pulled qwen2.5:7b.

Here are some output from console after I ran the testing command above. Are these expected? Let me know if I missed any setup steps.

Expected tool call name should be 'get_ros2_message_interface', but got get_ros2_topics_names_and_types
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name 'call_ros2_service', but got 'get_ros2_services_names_and_types'.
...
Field path 'parameters.0.name' not found in the message.
Expected service '/robot_state_publisher/set_parameters_atomically', but got 'None'.

@jmatejcz
Copy link
Contributor Author

@jmatejcz , I checked out this pull request and followed steps from rai_bench README to run benchmark test. I pulled qwen2.5:7b.

Here are some output from console after I ran the testing command above. Are these expected? Let me know if I missed any setup steps.

Expected tool call name should be 'get_ros2_message_interface', but got get_ros2_topics_names_and_types
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name 'call_ros2_service', but got 'get_ros2_services_names_and_types'.
...
Field path 'parameters.0.name' not found in the message.
Expected service '/robot_state_publisher/set_parameters_atomically', but got 'None'.

yes, these are logs during validation of tool calls. If you would like to see logs and results of the benchmark go to result folder that should be under src/rai_bench/rai_bench/experiments/tool_calling/ if you didn't pass custom out_path.
And please don't follow the README in rai_bench as it is outdated on this branch and development. Head over to docs ->
https://github.com/RobotecAI/rai/blob/jm/fix/manipulaiton-bench-issues/docs/tutorials/benchmarking.md

Docs will be updated once this branch is merged https://github.com/RobotecAI/rai/tree/jm/docs/bench-docs-update

@jmatejcz jmatejcz force-pushed the jm/fix/manipulaiton-bench-issues branch from ef72470 to a11c4f6 Compare July 28, 2025 11:59
Copy link
Collaborator

@Juliaj Juliaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the effort, this is a very large PR! I took a pass and changes look good. I left some comments.

@@ -53,12 +31,12 @@ If your goal is creating custom tasks and scenarios, visit [Creating Custom Task
This benchmark does not require any additional setup besides the main one [Basic Setup](../setup/install.md), just run:

```bash
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <model-name> --vendor <vendor> --extra-tool-calls <5> --task-types <basic> --out-dir <out_dir>
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <qwen2.5:7b> --vendor <ollama> --extra-tool-calls <0 5> --task-types basic --n-shots <0 2> --prompt-detail <brief descriptive> --complexities <easy medium hard> --out-dir <out_dir>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the previous generic values for --model-name <qwen2.5:7b> --vendor <ollama> instead of the current specific ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i changed that in other PR regarding docs #665

```

!!! note

This Benchmark is significantly faster, but still if just trying out, we recommend choosing just one task-type.
This Benchmark is significantly faster, but still, if just trying out, we recommend choosing just one parameter per flag as every combination on params will create more tasks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can say

This benchmark is significantly faster, but for initial testing, we recommend selecting only one parameter per flag since each parameter combination creates additional tasks.

o3de_config_path="src/rai_bench/rai_bench/manipulation_o3de/predefined/configs/o3de_config.yaml", # path to your o3de config
levels=[ # define what difficulty of tasks to include in benchmark
"trivial",
],
repeats=1, # how many times to repeat
)
tool_conf = ToolCallingAgentBenchmarkConfig(
extra_tool_calls=5, # how many extra tool calls allowed to still pass
extra_tool_calls=[0], # how many extra tool calls allowed to still pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what was the reasoning behind changing this value from 5 to 0?

repeats=1,
)

out_dir = "src/rai_bench/rai_bench/experiments"
test_models(
model_names=model_names,
vendors=vendors,
benchmark_configs=[man_conf, tool_conf],
benchmark_configs=[tool_conf],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to remove mani_conf from the configs? If so, we can remove mani_conf which it doesn't seem to be used anywhere.

bench_logger.critical(f"BENCHMARK RUN FAILED: {e}")
error_msg = traceback.format_exc()
bench_logger.critical(error_msg)
print(error_msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment" - the exception handling here only catches, logs, and continues, correct? I'm trying to understand how the new changes will actually stop the benchmark run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the error was caught here: https://github.com/RobotecAI/rai/pull/653/files#diff-76d8a11a893998856be6b462a5f01a0f7c0d3a0013fb979e286fce9fc3618247L311

which led to continuing with other scenarios

@@ -34,6 +34,8 @@
extra_tool_calls=args.extra_tool_calls,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args.extra_tool_calls is an int which needs to be converted to a list because extra_tool_calls is expected to be a list.

@jmatejcz jmatejcz force-pushed the jm/fix/manipulaiton-bench-issues branch from a11c4f6 to 6d37658 Compare August 8, 2025 07:09
@jmatejcz
Copy link
Contributor Author

jmatejcz commented Aug 8, 2025

Thanks for all the effort, this is a very large PR! I took a pass and changes look good. I left some comments.

i 'm sorry for the confusion, the commits with large changes were from other branches and appeared as new changes after squash and merge of other PRs. Now i rebased and PR is quite smaller ;p

@jmatejcz
Copy link
Contributor Author

jmatejcz commented Aug 8, 2025

level assigned was already merged in other PR to development, so it is not included in this PR

Copy link
Collaborator

@Juliaj Juliaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jmatejcz jmatejcz requested a review from boczekbartek August 11, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants