From 85546d99891dca7782c02514fb26a848a9e7075e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Kotowski?= Date: Mon, 18 Aug 2025 16:02:59 +0200 Subject: [PATCH 1/5] docs: improve rai_bench readme --- src/rai_bench/README.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/src/rai_bench/README.md b/src/rai_bench/README.md index b9a720aa3..706c68bae 100644 --- a/src/rai_bench/README.md +++ b/src/rai_bench/README.md @@ -12,7 +12,7 @@ The Manipulation O3DE Benchmark [manipulation_o3de_benchmark_module](./rai_bench - **PlaceObjectAtCoordTask** - **RotateObjectTask** (currently not applicable due to limitations in the ManipulatorMoveTo tool) -The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario. +The result of a task is a value between 0 and 1, calculated like `initially_misplaced_now_correct / initially_misplaced`. This score is calculated at the end of each scenario. ### Frame Components @@ -92,7 +92,7 @@ python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name l ``` > [!NOTE] -> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege) +> For now benchmark runs all available scenarios (~160). See [Examples](#example-usage) > section for details. ### Development @@ -100,7 +100,7 @@ python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name l When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/manipulation_o3de/tasks/). This applies also when you are adding or changing the helper methods in `Task` or `ManipulationTask`. -The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks. +The number of scenarios can be easily extended without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks. ## Tool Calling Agent Benchmark @@ -109,15 +109,16 @@ The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling age ### Frame Components - [Tool Calling Agent Benchmark](rai_bench//tool_calling_agent/benchmark.py) - Benchmark for LangChain tool calling agents -- [Scores tracing](rai_bench/tool_calling_agent_bench/scores_tracing.py) - Component handling sending scores to tracing backends +- [Scores tracing](rai_bench/results_processing/langfuse_scores_tracing.py) - Component handling sending scores to tracing backends - [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - Task, Validator, SubTask - For detailed description of validation visit -> [Validation](.//rai_bench/docs/tool_calling_agent_benchmark.md) -[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage. + For detailed description of validation visit -> [Validation](./rai_bench/docs/tool_calling_agent_benchmark.md) + +[tool_calling_agent_test_bench.py](./rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage. ### Example Usage -Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital. +`Validators` can be constructed from any `SubTasks`, `Tasks` can be validated by any number of `Validators`, which makes whole validation process incredibly versatile. ```python # subtasks @@ -144,7 +145,7 @@ GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]), ### Running -To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document. +To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/setup/tracing.md) document. To run the benchmark: @@ -169,7 +170,7 @@ The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks cont ### Running -To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document. +To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/setup/tracing.md) document. To run the benchmark: @@ -216,7 +217,7 @@ When you run a test via: python src/rai_bench/rai_bench/examples/test_models.py ``` -results will be saved to separate folder in [results](./rai_bench/experiments/), with prefix `run_` +results will be saved to separate folder in [experiments](./rai_bench/experiments/), with prefix `run_` To visualise the results run: From 82cf9fe84db580267f530a83c249096fac400ced Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Kotowski?= Date: Mon, 18 Aug 2025 16:08:58 +0200 Subject: [PATCH 2/5] docs: fix typo --- src/rai_bench/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/rai_bench/README.md b/src/rai_bench/README.md index 706c68bae..c17ecbc8f 100644 --- a/src/rai_bench/README.md +++ b/src/rai_bench/README.md @@ -182,7 +182,7 @@ python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b ## Testing Models -To test multiple models, different benchamrks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py) +To test multiple models, different benchmarks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py) Modify these params: From 30ba4b0c2c5880141bdfbdaa5bfc30f90d614bb3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Kotowski?= Date: Mon, 18 Aug 2025 16:16:23 +0200 Subject: [PATCH 3/5] docs: improve formatting in rai_bench README by adding backticks around class names --- src/rai_bench/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/rai_bench/README.md b/src/rai_bench/README.md index c17ecbc8f..739a228ea 100644 --- a/src/rai_bench/README.md +++ b/src/rai_bench/README.md @@ -10,7 +10,7 @@ The Manipulation O3DE Benchmark [manipulation_o3de_benchmark_module](./rai_bench - **GroupObjectsTask** - **BuildCubeTowerTask** - **PlaceObjectAtCoordTask** -- **RotateObjectTask** (currently not applicable due to limitations in the ManipulatorMoveTo tool) +- **RotateObjectTask** (currently not applicable due to limitations in the `ManipulatorMoveTo` tool) The result of a task is a value between 0 and 1, calculated like `initially_misplaced_now_correct / initially_misplaced`. This score is calculated at the end of each scenario. @@ -110,7 +110,7 @@ The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling age - [Tool Calling Agent Benchmark](rai_bench//tool_calling_agent/benchmark.py) - Benchmark for LangChain tool calling agents - [Scores tracing](rai_bench/results_processing/langfuse_scores_tracing.py) - Component handling sending scores to tracing backends -- [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - Task, Validator, SubTask +- [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - `Task`, `Validator`, `SubTask` For detailed description of validation visit -> [Validation](./rai_bench/docs/tool_calling_agent_benchmark.md) From c0c4e4e2fc0d268ccf1abc2fbbce41ae2f7a63fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Kotowski?= Date: Mon, 18 Aug 2025 17:06:32 +0200 Subject: [PATCH 4/5] docs: fix spaces --- docs/simulation_and_benchmarking/rai_bench.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/simulation_and_benchmarking/rai_bench.md b/docs/simulation_and_benchmarking/rai_bench.md index ea3ff34f1..09258c9f2 100644 --- a/docs/simulation_and_benchmarking/rai_bench.md +++ b/docs/simulation_and_benchmarking/rai_bench.md @@ -96,9 +96,9 @@ Evaluates agent performance independently from any simulation, based only on too The `SubTask` class is used to validate just one tool call. Following classes are available: - `CheckArgsToolCallSubTask` - verify if a certain tool was called with expected arguments -- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS 2topic was of proper type and included expected fields -- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS 2service was of proper type and included expected fields -- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS 2action was of proper type and included expected fields +- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS2 topic was of proper type and included expected fields +- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS2 service was of proper type and included expected fields +- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS2 action was of proper type and included expected fields ### Validator From 6988ba07a72c6745f5fda781ae32933b7caf37ca Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Kotowski?= Date: Mon, 18 Aug 2025 17:40:40 +0200 Subject: [PATCH 5/5] fix example yaml --- docs/tutorials/benchmarking.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/tutorials/benchmarking.md b/docs/tutorials/benchmarking.md index db2a56cd6..f46cfcb24 100644 --- a/docs/tutorials/benchmarking.md +++ b/docs/tutorials/benchmarking.md @@ -22,21 +22,21 @@ If your goal is creating custom tasks and scenarios, visit [Creating Custom Task level: RoboticManipulationBenchmark robotic_stack_command: ros2 launch examples/manipulation-demo-no-binary.launch.py required_simulation_ros2_interfaces: - services: - - /spawn_entity - - /delete_entity - topics: - - /color_image5 - - /depth_image5 - - /color_camera_info5 - actions: [] + services: + - /spawn_entity + - /delete_entity + topics: + - /color_image5 + - /depth_image5 + - /color_camera_info5 + actions: [] required_robotic_ros2_interfaces: - services: - - /grounding_dino_classify - - /grounded_sam_segment - - /manipulator_move_to - topics: [] - actions: [] + services: + - /grounding_dino_classify + - /grounded_sam_segment + - /manipulator_move_to + topics: [] + actions: [] ``` - Run the benchmark with: