-
Notifications
You must be signed in to change notification settings - Fork 49
fix: manipulaiton bench fixes #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
@jmatejcz , I checked out this pull request and followed steps from rai_bench README to run benchmark test. I pulled Here are some output from console after I ran the testing command above. Are these expected? Let me know if I missed any setup steps.
|
yes, these are logs during validation of tool calls. If you would like to see logs and results of the benchmark go to result folder that should be under Docs will be updated once this branch is merged https://github.com/RobotecAI/rai/tree/jm/docs/bench-docs-update |
ef72470
to
a11c4f6
Compare
64f401d
to
13244ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the effort, this is a very large PR! I took a pass and changes look good. I left some comments.
docs/tutorials/benchmarking.md
Outdated
@@ -53,12 +31,12 @@ If your goal is creating custom tasks and scenarios, visit [Creating Custom Task | |||
This benchmark does not require any additional setup besides the main one [Basic Setup](../setup/install.md), just run: | |||
|
|||
```bash | |||
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <model-name> --vendor <vendor> --extra-tool-calls <5> --task-types <basic> --out-dir <out_dir> | |||
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <qwen2.5:7b> --vendor <ollama> --extra-tool-calls <0 5> --task-types basic --n-shots <0 2> --prompt-detail <brief descriptive> --complexities <easy medium hard> --out-dir <out_dir> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the previous generic values for --model-name <qwen2.5:7b> --vendor <ollama>
instead of the current specific ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i changed that in other PR regarding docs #665
docs/tutorials/benchmarking.md
Outdated
``` | ||
|
||
!!! note | ||
|
||
This Benchmark is significantly faster, but still if just trying out, we recommend choosing just one task-type. | ||
This Benchmark is significantly faster, but still, if just trying out, we recommend choosing just one parameter per flag as every combination on params will create more tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can say
This benchmark is significantly faster, but for initial testing, we recommend selecting only one parameter per flag since each parameter combination creates additional tasks.
o3de_config_path="src/rai_bench/rai_bench/manipulation_o3de/predefined/configs/o3de_config.yaml", # path to your o3de config | ||
levels=[ # define what difficulty of tasks to include in benchmark | ||
"trivial", | ||
], | ||
repeats=1, # how many times to repeat | ||
) | ||
tool_conf = ToolCallingAgentBenchmarkConfig( | ||
extra_tool_calls=5, # how many extra tool calls allowed to still pass | ||
extra_tool_calls=[0], # how many extra tool calls allowed to still pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, what was the reasoning behind changing this value from 5 to 0?
repeats=1, | ||
) | ||
|
||
out_dir = "src/rai_bench/rai_bench/experiments" | ||
test_models( | ||
model_names=model_names, | ||
vendors=vendors, | ||
benchmark_configs=[man_conf, tool_conf], | ||
benchmark_configs=[tool_conf], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended to remove mani_conf
from the configs? If so, we can remove mani_conf which it doesn't seem to be used anywhere.
bench_logger.critical(f"BENCHMARK RUN FAILED: {e}") | ||
error_msg = traceback.format_exc() | ||
bench_logger.critical(error_msg) | ||
print(error_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment" - the exception handling here only catches, logs, and continues, correct? I'm trying to understand how the new changes will actually stop the benchmark run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the error was caught here: https://github.com/RobotecAI/rai/pull/653/files#diff-76d8a11a893998856be6b462a5f01a0f7c0d3a0013fb979e286fce9fc3618247L311
which led to continuing with other scenarios
@@ -34,6 +34,8 @@ | |||
extra_tool_calls=args.extra_tool_calls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args.extra_tool_calls
is an int which needs to be converted to a list because extra_tool_calls
is expected to be a list.
a11c4f6
to
6d37658
Compare
i 'm sorry for the confusion, the commits with large changes were from other branches and appeared as new changes after squash and merge of other PRs. Now i rebased and PR is quite smaller ;p |
level assigned was already merged in other PR to development, so it is not included in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Purpose
Fixing couple small issues that were occuring during execution:
Proposed Changes
test_models
function instead of inrun_next
. Before that when an unexpected exception occured, it would affect the single scenario execution. After that the next scenario would execute as planned, but that was the problem, because usually the problem was outside of scenario scope and would reoccur in every scenario till the end - for example cuda OOM. In such case it was a wasted time and the avg scores were impared by the 0.0 scores that were result of error.Now when unexpected error occurs the whole benchmark will stop, preserving the scores that were achieved till this moment.
Issues
Testing
Just run manipulation benchmark
you can check out the logs to see that levels are assigned properly