Skip to content

Commit fbf056f

Browse files
docs: improve rai_bench readme (#674)
1 parent f71f6c1 commit fbf056f

File tree

3 files changed

+31
-30
lines changed

3 files changed

+31
-30
lines changed

docs/simulation_and_benchmarking/rai_bench.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,9 +96,9 @@ Evaluates agent performance independently from any simulation, based only on too
9696
The `SubTask` class is used to validate just one tool call. Following classes are available:
9797

9898
- `CheckArgsToolCallSubTask` - verify if a certain tool was called with expected arguments
99-
- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS 2topic was of proper type and included expected fields
100-
- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS 2service was of proper type and included expected fields
101-
- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS 2action was of proper type and included expected fields
99+
- `CheckTopicFieldsToolCallSubTask` - verify if a message published to ROS2 topic was of proper type and included expected fields
100+
- `CheckServiceFieldsToolCallSubTask` - verify if a message published to ROS2 service was of proper type and included expected fields
101+
- `CheckActionFieldsToolCallSubTask` - verify if a message published to ROS2 action was of proper type and included expected fields
102102

103103
### Validator
104104

docs/tutorials/benchmarking.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -22,21 +22,21 @@ If your goal is creating custom tasks and scenarios, visit [Creating Custom Task
2222
level: RoboticManipulationBenchmark
2323
robotic_stack_command: ros2 launch examples/manipulation-demo-no-binary.launch.py
2424
required_simulation_ros2_interfaces:
25-
services:
26-
- /spawn_entity
27-
- /delete_entity
28-
topics:
29-
- /color_image5
30-
- /depth_image5
31-
- /color_camera_info5
32-
actions: []
25+
services:
26+
- /spawn_entity
27+
- /delete_entity
28+
topics:
29+
- /color_image5
30+
- /depth_image5
31+
- /color_camera_info5
32+
actions: []
3333
required_robotic_ros2_interfaces:
34-
services:
35-
- /grounding_dino_classify
36-
- /grounded_sam_segment
37-
- /manipulator_move_to
38-
topics: []
39-
actions: []
34+
services:
35+
- /grounding_dino_classify
36+
- /grounded_sam_segment
37+
- /manipulator_move_to
38+
topics: []
39+
actions: []
4040
```
4141
- Run the benchmark with:
4242

src/rai_bench/README.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@ The Manipulation O3DE Benchmark [manipulation_o3de_benchmark_module](./rai_bench
1010
- **GroupObjectsTask**
1111
- **BuildCubeTowerTask**
1212
- **PlaceObjectAtCoordTask**
13-
- **RotateObjectTask** (currently not applicable due to limitations in the ManipulatorMoveTo tool)
13+
- **RotateObjectTask** (currently not applicable due to limitations in the `ManipulatorMoveTo` tool)
1414

15-
The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.
15+
The result of a task is a value between 0 and 1, calculated like `initially_misplaced_now_correct / initially_misplaced`. This score is calculated at the end of each scenario.
1616

1717
### Frame Components
1818

@@ -92,15 +92,15 @@ python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name l
9292
```
9393

9494
> [!NOTE]
95-
> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege)
95+
> For now benchmark runs all available scenarios (~160). See [Examples](#example-usage)
9696
> section for details.
9797
9898
### Development
9999

100100
When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/manipulation_o3de/tasks/).
101101
This applies also when you are adding or changing the helper methods in `Task` or `ManipulationTask`.
102102

103-
The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.
103+
The number of scenarios can be easily extended without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.
104104

105105
## Tool Calling Agent Benchmark
106106

@@ -109,15 +109,16 @@ The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling age
109109
### Frame Components
110110

111111
- [Tool Calling Agent Benchmark](rai_bench//tool_calling_agent/benchmark.py) - Benchmark for LangChain tool calling agents
112-
- [Scores tracing](rai_bench/tool_calling_agent_bench/scores_tracing.py) - Component handling sending scores to tracing backends
113-
- [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - Task, Validator, SubTask
114-
For detailed description of validation visit -> [Validation](.//rai_bench/docs/tool_calling_agent_benchmark.md)
112+
- [Scores tracing](rai_bench/results_processing/langfuse_scores_tracing.py) - Component handling sending scores to tracing backends
113+
- [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - `Task`, `Validator`, `SubTask`
115114

116-
[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
115+
For detailed description of validation visit -> [Validation](./rai_bench/docs/tool_calling_agent_benchmark.md)
116+
117+
[tool_calling_agent_test_bench.py](./rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
117118

118119
### Example Usage
119120

120-
Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.
121+
`Validators` can be constructed from any `SubTasks`, `Tasks` can be validated by any number of `Validators`, which makes whole validation process incredibly versatile.
121122

122123
```python
123124
# subtasks
@@ -144,7 +145,7 @@ GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),
144145

145146
### Running
146147

147-
To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document.
148+
To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/setup/tracing.md) document.
148149

149150
To run the benchmark:
150151

@@ -169,7 +170,7 @@ The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks cont
169170

170171
### Running
171172

172-
To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document.
173+
To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/setup/tracing.md) document.
173174

174175
To run the benchmark:
175176

@@ -181,7 +182,7 @@ python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b
181182

182183
## Testing Models
183184

184-
To test multiple models, different benchamrks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py)
185+
To test multiple models, different benchmarks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py)
185186

186187
Modify these params:
187188

@@ -216,7 +217,7 @@ When you run a test via:
216217
python src/rai_bench/rai_bench/examples/test_models.py
217218
```
218219

219-
results will be saved to separate folder in [results](./rai_bench/experiments/), with prefix `run_`
220+
results will be saved to separate folder in [experiments](./rai_bench/experiments/), with prefix `run_`
220221

221222
To visualise the results run:
222223

0 commit comments

Comments
 (0)