docs: improve rai_bench readme (#674)

pawel-kotowski · web-flow · commit fbf056f5e7a0 · 2025-08-19T11:29:27.000+02:00
diff --git a/docs/simulation_and_benchmarking/rai_bench.md b/docs/simulation_and_benchmarking/rai_bench.md
@@ -96,9 +96,9 @@ Evaluates agent performance independently from any simulation, based only on too
 The `SubTask` class is used to validate just one tool call. Following classes are available:
 
 -   `CheckArgsToolCallSubTask` - verify if a certain tool was called with expected arguments
--   `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS 2topic was of proper type and included expected fields
--   `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS 2service was of proper type and included expected fields
--   `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS 2action was of proper type and included expected fields
+-   `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS2 topic was of proper type and included expected fields
+-   `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS2 service was of proper type and included expected fields
+-   `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS2 action was of proper type and included expected fields
 
 ### Validator
 
diff --git a/docs/tutorials/benchmarking.md b/docs/tutorials/benchmarking.md
@@ -22,21 +22,21 @@ If your goal is creating custom tasks and scenarios, visit [Creating Custom Task
     level: RoboticManipulationBenchmark
     robotic_stack_command: ros2 launch examples/manipulation-demo-no-binary.launch.py
     required_simulation_ros2_interfaces:
-    services:
-        - /spawn_entity
-        - /delete_entity
-    topics:
-        - /color_image5
-        - /depth_image5
-        - /color_camera_info5
-    actions: []
+        services:
+            - /spawn_entity
+            - /delete_entity
+        topics:
+            - /color_image5
+            - /depth_image5
+            - /color_camera_info5
+        actions: []
     required_robotic_ros2_interfaces:
-    services:
-        - /grounding_dino_classify
-        - /grounded_sam_segment
-        - /manipulator_move_to
-    topics: []
-    actions: []
+        services:
+            - /grounding_dino_classify
+            - /grounded_sam_segment
+            - /manipulator_move_to
+        topics: []
+        actions: []
     ```
 -   Run the benchmark with:
 
diff --git a/src/rai_bench/README.md b/src/rai_bench/README.md
@@ -10,9 +10,9 @@ The Manipulation O3DE Benchmark [manipulation_o3de_benchmark_module](./rai_bench
 -   **GroupObjectsTask**
 -   **BuildCubeTowerTask**
 -   **PlaceObjectAtCoordTask**
--   **RotateObjectTask** (currently not applicable due to limitations in the ManipulatorMoveTo tool)
+-   **RotateObjectTask** (currently not applicable due to limitations in the `ManipulatorMoveTo` tool)
 
-The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.
+The result of a task is a value between 0 and 1, calculated like `initially_misplaced_now_correct / initially_misplaced`. This score is calculated at the end of each scenario.
 
 ### Frame Components
 
@@ -92,15 +92,15 @@ python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name l
 ```
 
 > [!NOTE]
-> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege)
+> For now benchmark runs all available scenarios (~160). See [Examples](#example-usage)
 > section for details.
 
 ### Development
 
 When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/manipulation_o3de/tasks/).
 This applies also when you are adding or changing the helper methods in `Task` or `ManipulationTask`.
 
-The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.
+The number of scenarios can be easily extended without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.
 
 ## Tool Calling Agent Benchmark
 
@@ -109,15 +109,16 @@ The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling age
 ### Frame Components
 
 -   [Tool Calling Agent Benchmark](rai_bench//tool_calling_agent/benchmark.py) - Benchmark for LangChain tool calling agents
--   [Scores tracing](rai_bench/tool_calling_agent_bench/scores_tracing.py) - Component handling sending scores to tracing backends
--   [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - Task, Validator, SubTask
-    For detailed description of validation visit -> [Validation](.//rai_bench/docs/tool_calling_agent_benchmark.md)
+-   [Scores tracing](rai_bench/results_processing/langfuse_scores_tracing.py) - Component handling sending scores to tracing backends
+-   [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - `Task`, `Validator`, `SubTask`
 
-[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
+    For detailed description of validation visit -> [Validation](./rai_bench/docs/tool_calling_agent_benchmark.md)
+
+[tool_calling_agent_test_bench.py](./rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
 
 ### Example Usage
 
-Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.
+`Validators` can be constructed from any `SubTasks`, `Tasks` can be validated by any number of `Validators`, which makes whole validation process incredibly versatile.
 
 ```python
 # subtasks
@@ -144,7 +145,7 @@ GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),
 
 ### Running
 
-To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document.
+To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/setup/tracing.md) document.
 
 To run the benchmark:
 
@@ -169,7 +170,7 @@ The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks cont
 
 ### Running
 
-To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document.
+To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/setup/tracing.md) document.
 
 To run the benchmark:
 
@@ -181,7 +182,7 @@ python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b
 
 ## Testing Models
 
-To test multiple models, different benchamrks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py)
+To test multiple models, different benchmarks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py)
 
 Modify these params:
 
@@ -216,7 +217,7 @@ When you run a test via:
 python src/rai_bench/rai_bench/examples/test_models.py
 ```
 
-results will be saved to separate folder in [results](./rai_bench/experiments/), with prefix `run_`
+results will be saved to separate folder in [experiments](./rai_bench/experiments/), with prefix `run_`
 
 To visualise the results run: