-
Notifications
You must be signed in to change notification settings - Fork 15
docs: add data generation tutorial for synthesized data pipeline #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yvvonie
wants to merge
7
commits into
DexForce:main
Choose a base branch
from
yvvonie:GYY_Tutorial_Add
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+190
−0
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
23e4d75
docs: add data generation tutorial for synthesized data pipeline
yvvonie aa3d2ed
Merge main
yvvonie e42e49f
docs: update data generation tutorial paths
yvvonie df75ead
Merge branch 'main' of https://github.com/DexForce/EmbodiChain into d…
yvvonie fab9843
Merge branch 'main' into GYY_Tutorial_Add
yvvonie d828c60
docs: update data generation tutorial
yvvonie ed454b4
Merge docs/data-generation-tutorial into GYY_Tutorial_Add, keeping th…
yvvonie File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| .. _tutorial_data_generation: | ||
|
|
||
| Data Generation | ||
| =============== | ||
|
|
||
| .. currentmodule:: embodichain.lab.gym | ||
|
|
||
| This tutorial shows how to generate synthetic expert demonstration datasets using EmbodiChain's built-in environment rollout and dataset manager. You will learn how to configure LeRobot recording in ``gym_config.json``, how ``run_env.py`` builds an environment from configuration files, and how completed episodes are automatically saved to disk. | ||
|
|
||
| Overview | ||
| ~~~~~~~~ | ||
|
|
||
| EmbodiChain provides a built-in data generation workflow for imitation-learning and manipulation tasks: | ||
|
|
||
| - **Gym Configuration**: Describes the scene, robot, sensors, randomization events, observations, dataset recorder, and rollout settings. | ||
| - **Action Configuration**: Describes the task-specific expert action graph for tasks that use the action bank. | ||
| - **Environment Rollout**: Builds the environment directly from configuration files and executes offline generation. | ||
| - **Expert Policy**: Each task provides ``create_demo_action_list()`` or another scripted policy entry to generate expert actions. | ||
| - **Dataset Manager**: Records observation-action pairs during ``env.step()``. | ||
| - **LeRobotRecorder**: Converts completed episodes into LeRobot-compatible datasets, with optional video export. | ||
|
|
||
| What This Tutorial Records | ||
| -------------------------- | ||
|
|
||
| This page documents the full path from task configuration to saved dataset: | ||
|
|
||
| 1. Prepare a task ``gym_config.json``. | ||
| 2. Prepare an ``action_config.json`` if the task uses the action bank. | ||
| 3. Launch the environment rollout with ``run-env``. | ||
| 4. Let the dataset manager automatically save completed episodes. | ||
|
|
||
| Example Task | ||
| ------------ | ||
|
|
||
| As a concrete example, this tutorial uses a real action-bank task shipped in the repository: | ||
|
|
||
| - ``configs/gym/pour_water/gym_config.json`` defines the simulation scene and dataset recording behavior. | ||
| - ``configs/gym/pour_water/action_config.json`` defines the action-bank graph used to solve the task. | ||
|
|
||
| The Code | ||
| ~~~~~~~~ | ||
|
|
||
| The tutorial corresponds to the ``run_env.py`` script in ``embodichain/lab/scripts``. | ||
|
|
||
| .. dropdown:: Code for run_env.py | ||
| :icon: code | ||
|
|
||
| .. literalinclude:: ../../../embodichain/lab/scripts/run_env.py | ||
| :language: python | ||
| :linenos: | ||
|
|
||
|
|
||
| The Code Explained | ||
| ~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The rollout script builds the environment from configuration, generates expert trajectories, executes them step by step, and relies on the dataset manager to auto-save valid episodes. | ||
|
|
||
| Step 1: Prepare the Task Configuration | ||
| -------------------------------------- | ||
|
|
||
| The first input to the pipeline is the task ``gym_config.json``. In the example below, the same file contains rollout settings, scene randomization, observations, dataset recording, and robot or sensor definitions. | ||
|
|
||
| The rollout settings include the episode count: | ||
|
|
||
| .. literalinclude:: ../../../configs/gym/pour_water/gym_config.json | ||
| :language: json | ||
| :lines: 2-4 | ||
|
|
||
| The dataset-related part looks like this: | ||
|
|
||
| .. literalinclude:: ../../../configs/gym/pour_water/gym_config.json | ||
| :language: json | ||
| :lines: 261-281 | ||
|
|
||
| Important parameters are: | ||
|
|
||
| - **max_episodes**: Number of rollout episodes generated by ``run_env.py``. | ||
| - **max_episode_steps**: Maximum number of environment steps per episode. | ||
| - **dataset.lerobot.params.robot_meta**: Robot metadata such as robot type and control frequency. | ||
| - **dataset.lerobot.params.instruction**: Task language instruction stored together with the dataset. | ||
| - **dataset.lerobot.params.extra**: Additional metadata such as scene type and task description. | ||
| - **dataset.lerobot.params.use_videos**: Whether camera observations should be stored as videos. | ||
| - **env.control_parts**: Controlled robot parts in the environment. | ||
|
|
||
|
|
||
| In the current implementation, ``LeRobotRecorder`` stores robot state and action features such as ``observation.qpos``, ``observation.qvel``, ``observation.qf``, ``action``, and camera images when sensors are present. | ||
|
|
||
| Step 2: Prepare the Action Configuration | ||
| ---------------------------------------- | ||
|
|
||
| For tasks that use the action bank, the second input is ``action_config.json``. This file defines the expert action graph consumed by ``create_demo_action_list()``. In the example below, the file is organized around ``scope``, ``node``, ``edge``, and ``sync``. | ||
|
|
||
| .. dropdown:: Action bank structure in the example task Pour_Water | ||
| :icon: code | ||
|
|
||
| **Scope Configuration** | ||
|
|
||
| .. literalinclude:: ../../../configs/gym/pour_water/action_config.json | ||
| :language: json | ||
| :lines: 2-57 | ||
|
|
||
| **Node Configuration** | ||
|
|
||
| .. literalinclude:: ../../../configs/gym/pour_water/action_config.json | ||
| :language: json | ||
| :lines: 96-177 | ||
|
|
||
| **Edge Configuration** | ||
|
|
||
| .. literalinclude:: ../../../configs/gym/pour_water/action_config.json | ||
| :language: json | ||
| :lines: 763-790 | ||
|
|
||
| **Synchronization** | ||
|
|
||
| .. literalinclude:: ../../../configs/gym/pour_water/action_config.json | ||
| :language: json | ||
| :lines: 906-932 | ||
|
|
||
| This structure defines the expert rollout as follows: | ||
|
|
||
| - **Scope**: Defines controllable sub-graphs such as ``right_arm``, ``left_arm``, ``right_eef``, and ``left_eef``. | ||
| - **Node**: Defines key poses, targets computed from object affordances, and IK-generated joint targets. | ||
| - **Edge**: Defines executable transitions between nodes, including duration and execution function. | ||
| - **Sync**: Defines execution order rules between independently configured sub-actions. | ||
|
|
||
| Note: Action bank is not the only way to generate demonstrations. Depending on the task design, trajectories can also be produced by other scripted generation methods. | ||
|
|
||
| Step 3: Launch the Environment Rollout | ||
| -------------------------------------- | ||
|
|
||
| The rollout script parses command-line arguments, loads ``gym_config.json`` and ``action_config.json``, converts them into environment configuration objects, creates the environment instance, and then runs offline rollout for ``max_episodes`` episodes: | ||
|
|
||
| .. literalinclude:: ../../../embodichain/lab/scripts/run_env.py | ||
| :language: python | ||
| :start-at: def cli(): | ||
| :end-at: main(args, env, gym_config) | ||
|
|
||
| Each rollout internally calls ``create_demo_action_list()``, validates the returned sequence, executes actions with ``env.step(action)``, and discards invalid rollouts by resetting with ``save_data=False``. | ||
|
|
||
| The recommended CLI entrypoint is: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| python -m embodichain run-env \ | ||
| --gym_config configs/gym/pour_water/gym_config.json \ | ||
| --action_config configs/gym/pour_water/action_config.json \ | ||
| --headless | ||
|
|
||
| For interactive inspection, you can use preview mode: replace ``--headless`` with ``--preview``. | ||
| When ``--preview`` is enabled, the script opens the environment in an interactive debugging mode. This mode is for inspection and does not save datasets. | ||
|
|
||
|
|
||
| Useful CLI arguments: | ||
|
|
||
| - **--gym_config**: Path to the task JSON configuration. | ||
| - **--action_config**: Path to the action-bank configuration. | ||
| - **--num_envs**: Number of environments to run in parallel. | ||
| - **--device**: Simulation device, such as ``cpu`` or ``cuda``. | ||
| - **--headless**: Run without GUI for faster generation. | ||
| - **--enable_rt**: Enable ray tracing for higher-quality visual observations. | ||
| - **--preview**: Launch the environment in interactive preview mode. | ||
| - **--filter_dataset_saving**: Disable dataset saving for debugging. | ||
|
|
||
| For the complete CLI argument list, see :doc:`CLI Reference </guides/cli>`. | ||
|
|
||
| Outputs | ||
| ~~~~~~~ | ||
|
|
||
| After successful execution, completed episodes are saved under the configured dataset root. A LeRobot dataset typically contains: | ||
|
|
||
| If no explicit save path is provided and ``EMBODICHAIN_DATASET_ROOT`` is not set, ``LeRobotRecorder`` uses ``~/.cache/embodichain_datasets`` as the default dataset root. | ||
|
|
||
| - **data/**: Recorded action and state data. | ||
| - **videos/**: Camera observations saved as videos when ``use_videos=True``. | ||
| - **meta/**: Dataset metadata such as task information and robot description. | ||
|
|
||
| Dataset folders are automatically numbered, which makes it easy to run repeated generations without overwriting previous results. | ||
|
|
||
| In a practical workflow, the output of this stage is the synthesized dataset itself. Later training scripts typically consume these saved LeRobot episodes instead of regenerating trajectories each time. | ||
|
|
||
| Best Practices | ||
| ~~~~~~~~~~~~~~ | ||
|
|
||
| - **Keep the config pair together**: Version ``gym_config.json`` and ``action_config.json`` together for action-bank tasks. | ||
| - **Use valid scripted policies**: Make sure ``create_demo_action_list()`` returns executable trajectories for the current scene. | ||
| - **Use ``--headless`` for throughput**: Disable the GUI when generating large datasets. | ||
| - **Use ``--preview`` and ``--filter_dataset_saving`` for debugging**: Inspect task logic without writing datasets. | ||
| - **Discard invalid rollouts**: Keep the default validation logic so failed trajectories are not saved. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,5 +17,6 @@ Tutorials | |
| gizmo | ||
| basic_env | ||
| modular_env | ||
| data_generation | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move above rl section |
||
| rl | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use
Environment Rolloutwould be better