-
Notifications
You must be signed in to change notification settings - Fork 51
fix: manipulaiton bench fixes #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jmatejcz , I checked out this pull request and followed steps from rai_bench README to run benchmark test. I pulled Here are some output from console after I ran the testing command above. Are these expected? Let me know if I missed any setup steps.
|
yes, these are logs during validation of tool calls. If you would like to see logs and results of the benchmark go to result folder that should be under Docs will be updated once this branch is merged https://github.com/RobotecAI/rai/tree/jm/docs/bench-docs-update |
ef72470
to
a11c4f6
Compare
64f401d
to
13244ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the effort, this is a very large PR! I took a pass and changes look good. I left some comments.
|
||
```bash | ||
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <model-name> --vendor <vendor> --extra-tool-calls <5> --task-types <basic> --out-dir <out_dir> | ||
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <qwen2.5:7b> --vendor <ollama> --extra-tool-calls <0 5> --task-types basic --n-shots <0 2> --prompt-detail <brief descriptive> --complexities <easy medium hard> --out-dir <out_dir> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the previous generic values for --model-name <qwen2.5:7b> --vendor <ollama>
instead of the current specific ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i changed that in other PR regarding docs #665
!!! note | ||
|
||
This Benchmark is significantly faster, but still if just trying out, we recommend choosing just one task-type. | ||
This Benchmark is significantly faster, but still, if just trying out, we recommend choosing just one parameter per flag as every combination on params will create more tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can say
This benchmark is significantly faster, but for initial testing, we recommend selecting only one parameter per flag since each parameter combination creates additional tasks.
) | ||
tool_conf = ToolCallingAgentBenchmarkConfig( | ||
extra_tool_calls=5, # how many extra tool calls allowed to still pass | ||
extra_tool_calls=[0], # how many extra tool calls allowed to still pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, what was the reasoning behind changing this value from 5 to 0?
model_names=model_names, | ||
vendors=vendors, | ||
benchmark_configs=[man_conf, tool_conf], | ||
benchmark_configs=[tool_conf], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended to remove mani_conf
from the configs? If so, we can remove mani_conf which it doesn't seem to be used anywhere.
bench_logger.critical(f"BENCHMARK RUN FAILED: {e}") | ||
error_msg = traceback.format_exc() | ||
bench_logger.critical(error_msg) | ||
print(error_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment" - the exception handling here only catches, logs, and continues, correct? I'm trying to understand how the new changes will actually stop the benchmark run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the error was caught here: https://github.com/RobotecAI/rai/pull/653/files#diff-76d8a11a893998856be6b462a5f01a0f7c0d3a0013fb979e286fce9fc3618247L311
which led to continuing with other scenarios
bench_logger = define_benchmark_logger(out_dir=experiment_dir) | ||
|
||
tasks = get_tasks( | ||
extra_tool_calls=args.extra_tool_calls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args.extra_tool_calls
is an int which needs to be converted to a list because extra_tool_calls
is expected to be a list.
a11c4f6
to
6d37658
Compare
i 'm sorry for the confusion, the commits with large changes were from other branches and appeared as new changes after squash and merge of other PRs. Now i rebased and PR is quite smaller ;p |
level assigned was already merged in other PR to development, so it is not included in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(`nav2_toolkit`): remove unused `action_client` (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Julia Jia <[email protected]> Co-authored-by: Magdalena Kotynia <[email protected]> Co-authored-by: Maciej Majek <[email protected]> Co-authored-by: Pawel Kotowski <[email protected]> Co-authored-by: Brian Tuan <[email protected]>
…#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(`nav2_toolkit`): remove unused `action_client` (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Julia Jia <[email protected]> Co-authored-by: Magdalena Kotynia <[email protected]> Co-authored-by: Maciej Majek <[email protected]> Co-authored-by: Pawel Kotowski <[email protected]> Co-authored-by: Brian Tuan <[email protected]>
feat: tool calling benchmark unified across types and prompts variety… (#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(nav2_toolkit): remove unused action_client (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Jakub Matejczyk <[email protected]> Co-authored-by: Julia Jia <[email protected]> Co-authored-by: Magdalena Kotynia <[email protected]> Co-authored-by: Pawel Kotowski <[email protected]> Co-authored-by: Brian Tuan <[email protected]> Co-authored-by: jmatejcz <[email protected]>
Purpose
Fixing couple small issues that were occuring during execution:
Proposed Changes
test_models
function instead of inrun_next
. Before that when an unexpected exception occured, it would affect the single scenario execution. After that the next scenario would execute as planned, but that was the problem, because usually the problem was outside of scenario scope and would reoccur in every scenario till the end - for example cuda OOM. In such case it was a wasted time and the avg scores were impared by the 0.0 scores that were result of error.Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment.
Issues
Testing
Just run manipulation benchmark
you can check out the logs to see that levels are assigned properly