fix: manipulaiton bench fixes #653

jmatejcz · 2025-07-25T08:25:56Z

Purpose

Fixing couple small issues that were occuring during execution:

Some scenarios didn't have proper level assigned
Errors were catched in a manner that was not informative and lead to further execution
Fixed response returned by MoveObjectFromToTool

Proposed Changes

Now all scenarios have proper level assignment in predefined scenarios
Errors are now catched in test_models function instead of in run_next. Before that when an unexpected exception occured, it would affect the single scenario execution. After that the next scenario would execute as planned, but that was the problem, because usually the problem was outside of scenario scope and would reoccur in every scenario till the end - for example cuda OOM. In such case it was a wasted time and the avg scores were impared by the 0.0 scores that were result of error.
Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment.

Issues

Testing

Just run manipulation benchmark

python src/rai_bench/rai_bench/examples/benchmarking_models.py

you can check out the logs to see that levels are assigned properly

Juliaj · 2025-07-27T05:35:20Z

@jmatejcz , I checked out this pull request and followed steps from rai_bench README to run benchmark test. I pulled qwen2.5:7b.

Here are some output from console after I ran the testing command above. Are these expected? Let me know if I missed any setup steps.

Expected tool call name should be 'get_ros2_message_interface', but got get_ros2_topics_names_and_types
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name 'call_ros2_service', but got 'get_ros2_services_names_and_types'.
...
Field path 'parameters.0.name' not found in the message.
Expected service '/robot_state_publisher/set_parameters_atomically', but got 'None'.

jmatejcz · 2025-07-28T07:30:15Z

@jmatejcz , I checked out this pull request and followed steps from rai_bench README to run benchmark test. I pulled qwen2.5:7b.

Here are some output from console after I ran the testing command above. Are these expected? Let me know if I missed any setup steps.
Expected tool call name should be 'get_ros2_message_interface', but got get_ros2_topics_names_and_types
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name should be 'get_ros2_message_interface', but got publish_ros2_message
Expected tool call name 'call_ros2_service', but got 'get_ros2_services_names_and_types'.
...
Field path 'parameters.0.name' not found in the message.
Expected service '/robot_state_publisher/set_parameters_atomically', but got 'None'.

yes, these are logs during validation of tool calls. If you would like to see logs and results of the benchmark go to result folder that should be under src/rai_bench/rai_bench/experiments/tool_calling/ if you didn't pass custom out_path.
And please don't follow the README in rai_bench as it is outdated on this branch and development. Head over to docs ->
https://github.com/RobotecAI/rai/blob/jm/fix/manipulaiton-bench-issues/docs/tutorials/benchmarking.md

Docs will be updated once this branch is merged https://github.com/RobotecAI/rai/tree/jm/docs/bench-docs-update

Juliaj

Thanks for all the effort, this is a very large PR! I took a pass and changes look good. I left some comments.

Juliaj · 2025-08-06T19:35:34Z

docs/tutorials/benchmarking.md


 ```bash
-python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <model-name> --vendor <vendor> --extra-tool-calls <5> --task-types <basic> --out-dir <out_dir>
+python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <qwen2.5:7b> --vendor <ollama> --extra-tool-calls <0 5> --task-types basic  --n-shots <0 2> --prompt-detail <brief  descriptive> --complexities <easy medium hard> --out-dir <out_dir>


Consider using the previous generic values for --model-name <qwen2.5:7b> --vendor <ollama> instead of the current specific ones.

yes i changed that in other PR regarding docs #665

Juliaj · 2025-08-06T19:37:11Z

docs/tutorials/benchmarking.md

 !!! note

-    This Benchmark is significantly faster, but still if just trying out, we recommend choosing just one task-type.
+    This Benchmark is significantly faster, but still, if just trying out, we recommend choosing just one parameter per flag as every combination on params will create more tasks.


Alternatively, we can say

This benchmark is significantly faster, but for initial testing, we recommend selecting only one parameter per flag since each parameter combination creates additional tasks.

Juliaj · 2025-08-06T19:42:16Z

src/rai_bench/rai_bench/examples/benchmarking_models.py

    )
    tool_conf = ToolCallingAgentBenchmarkConfig(
-        extra_tool_calls=5,  # how many extra tool calls allowed to still pass
+        extra_tool_calls=[0],  # how many extra tool calls allowed to still pass


For my understanding, what was the reasoning behind changing this value from 5 to 0?

Juliaj · 2025-08-06T19:44:01Z

src/rai_bench/rai_bench/examples/benchmarking_models.py

        model_names=model_names,
        vendors=vendors,
-        benchmark_configs=[man_conf, tool_conf],
+        benchmark_configs=[tool_conf],


Is it intended to remove mani_conf from the configs? If so, we can remove mani_conf which it doesn't seem to be used anywhere.

Juliaj · 2025-08-06T20:31:25Z

src/rai_bench/rai_bench/test_models.py

+                        bench_logger.critical(f"BENCHMARK RUN FAILED: {e}")
+                        error_msg = traceback.format_exc()
+                        bench_logger.critical(error_msg)
+                        print(error_msg)


Regarding "Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment" - the exception handling here only catches, logs, and continues, correct? I'm trying to understand how the new changes will actually stop the benchmark run.

because the error was caught here: https://github.com/RobotecAI/rai/pull/653/files#diff-76d8a11a893998856be6b462a5f01a0f7c0d3a0013fb979e286fce9fc3618247L311

which led to continuing with other scenarios

Juliaj · 2025-08-06T21:45:38Z

src/rai_bench/rai_bench/examples/tool_calling_agent.py

    bench_logger = define_benchmark_logger(out_dir=experiment_dir)

    tasks = get_tasks(
        extra_tool_calls=args.extra_tool_calls,


args.extra_tool_calls is an int which needs to be converted to a list because extra_tool_calls is expected to be a list.

jmatejcz · 2025-08-08T07:22:13Z

Thanks for all the effort, this is a very large PR! I took a pass and changes look good. I left some comments.

i 'm sorry for the confusion, the commits with large changes were from other branches and appeared as new changes after squash and merge of other PRs. Now i rebased and PR is quite smaller ;p

jmatejcz · 2025-08-08T07:23:41Z

level assigned was already merged in other PR to development, so it is not included in this PR

Juliaj

LGTM.

…#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(`nav2_toolkit`): remove unused `action_client` (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Julia Jia <[email protected]> Co-authored-by: Magdalena Kotynia <[email protected]> Co-authored-by: Maciej Majek <[email protected]> Co-authored-by: Pawel Kotowski <[email protected]> Co-authored-by: Brian Tuan <[email protected]>

feat: tool calling benchmark unified across types and prompts variety… (#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(nav2_toolkit): remove unused action_client (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Jakub Matejczyk <[email protected]> Co-authored-by: Julia Jia <[email protected]> Co-authored-by: Magdalena Kotynia <[email protected]> Co-authored-by: Pawel Kotowski <[email protected]> Co-authored-by: Brian Tuan <[email protected]> Co-authored-by: jmatejcz <[email protected]>

jmatejcz changed the base branch from main to development July 25, 2025 08:26

jmatejcz marked this pull request as ready for review July 25, 2025 08:38

jmatejcz force-pushed the jm/fix/manipulaiton-bench-issues branch from ef72470 to a11c4f6 Compare July 28, 2025 11:59

maciejmajek force-pushed the development branch from 64f401d to 13244ef Compare August 6, 2025 09:06

Juliaj reviewed Aug 6, 2025

View reviewed changes

jmatejcz added 2 commits August 8, 2025 09:04

refactor: adjustment of error catching

6d44d45

fix: response fix in MoveFromTo Tool

6d37658

jmatejcz force-pushed the jm/fix/manipulaiton-bench-issues branch from a11c4f6 to 6d37658 Compare August 8, 2025 07:09

Juliaj approved these changes Aug 8, 2025

View reviewed changes

jmatejcz requested a review from boczekbartek August 11, 2025 15:55

jmatejcz merged commit 4c31d5d into development Sep 1, 2025
8 of 9 checks passed

jmatejcz deleted the jm/fix/manipulaiton-bench-issues branch September 1, 2025 09:40

maciejmajek mentioned this pull request Sep 15, 2025

chore: sync development -> main #691

Merged

fix: manipulaiton bench fixes #653

fix: manipulaiton bench fixes #653

Uh oh!

Conversation

jmatejcz commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Proposed Changes

Issues

Testing

Uh oh!

Juliaj commented Jul 27, 2025

Uh oh!

jmatejcz commented Jul 28, 2025

Uh oh!

Juliaj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmatejcz commented Aug 8, 2025

Uh oh!

jmatejcz commented Aug 8, 2025

Uh oh!

Juliaj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmatejcz commented Jul 25, 2025 •

edited

Loading