diff --git a/README.md b/README.md
index 34634aeaf..aa8284e2b 100644
--- a/README.md
+++ b/README.md
@@ -157,6 +157,12 @@ If you would like to try on your computer:
 |------------:|:------------:|:------------:|:------------:|:------------:|
 | [rvc](/audio_processing/rvc/) | [Retrieval-based-Voice-Conversion-WebUI](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) | Pytorch | 1.2.12 and later | [JP](https://medium.com/axinc/rvc-ai%E3%82%92%E4%BD%BF%E7%94%A8%E3%81%97%E3%81%9F%E3%83%9C%E3%82%A4%E3%82%B9%E3%83%81%E3%82%A7%E3%83%B3%E3%82%B8%E3%83%A3%E3%83%BC-64a813c7a0c4) |
 
+## Autonomous Driving
+
+| Model | Reference | Exported From | Supported Ailia Version | Blog |
+|------------:|:------------:|:------------:|:------------:|:------------:|
+|[uniad](/autonomous_driving/uniad/) | [UniAD: Unified Driving](https://github.com/OpenDriveLab/UniAD) | Pytorch | 1.6.1 and later | |
+
 ## Background removal
 
 | | Model | Reference | Exported From | Supported Ailia Version | Blog |
diff --git a/autonomous_driving/uniad/LICENSE b/autonomous_driving/uniad/LICENSE
new file mode 100644
index 000000000..261eeb9e9
--- /dev/null
+++ b/autonomous_driving/uniad/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/autonomous_driving/uniad/README.md b/autonomous_driving/uniad/README.md
new file mode 100644
index 000000000..3ccb4d22c
--- /dev/null
+++ b/autonomous_driving/uniad/README.md
@@ -0,0 +1,161 @@
+# UniAD
+
+Unified autonomous driving inference and visualization pipeline on NuScenes. Uses ONNX Runtime or ailia to run BEV encoder, tracking, motion, occupancy, and planning heads; streams per-frame results to HDF5, then visualizes to images and video.
+
+## Requirements
+
+### Dependencies
+
+Install the required Python packages:
+
+```bash
+pip3 install h5py
+pip3 install casadi==3.6.7
+pip3 install nuscenes-devkit==1.1.11
+pip3 install pyquaternion==0.9.9
+```
+
+### Dataset Setup
+
+Download the nuScenes V1.0 full dataset, CAN bus extension from [HERE](https://www.nuscenes.org/download), then place the dataset files under `data/nuscenes/`.  
+
+Download nuScenes, CAN_bus and Map extensions:
+
+```shell
+cd uniad
+mkdir data
+cd uniad/data
+mkdir nuscenes && cd nuscenes
+# Download nuScenes V1.0 full dataset data directly to (or soft link to) uniad/data/nuscenes/
+# Download CAN_bus directly to (or soft link to) uniad/data/nuscenes/
+```
+
+Prepare UniAD data info (Optional):
+
+```shell
+cd uniad/data
+mkdir infos && cd infos
+wget https://github.com/OpenDriveLab/UniAD/releases/download/v1.0/nuscenes_infos_temporal_train.pkl  # train_infos
+wget https://github.com/OpenDriveLab/UniAD/releases/download/v1.0/nuscenes_infos_temporal_val.pkl  # val_infos
+```
+
+Directory structure:
+
+```
+uniad
+├── data/
+│   ├── nuscenes/
+│   │   ├── can_bus/
+│   │   ├── maps/
+│   │   ├── samples/
+│   │   ├── sweeps/
+│   │   ├── v1.0-test/
+│   │   ├── v1.0-trainval/
+│   ├── infos/
+│   │   ├── nuscenes_infos_temporal_train.pkl
+│   │   ├── nuscenes_infos_temporal_val.pkl
+```
+
+## Usage
+
+Automatically downloads the onnx and prototxt files on the first run. Internet connection is required for the initial download.
+
+For the sample image,
+```bash
+python3 uniad.py
+```
+
+This will process scenes `scene-0102` and `scene-0103` from `v1.0-trainval`, then visualize `scene-0103`.
+
+### Workflow
+
+The typical workflow consists of two steps:
+
+1. Run model inference on specified scenes and save results to HDF5 file
+2. Load HDF5 results and generate visualization images/videos
+
+### Step 1: Run Inference
+
+Run model inference and save results to HDF5:
+
+```bash
+python3 uniad.py --results_file all_frames.h5 \
+	--data_root data/nuscenes/ \
+	--version v1.0-trainval \
+	--scenes scene-0102 --scenes scene-0103
+```
+
+Parameters:
+
+- `--data_root`: Path to NuScenes dataset root (default: `data/nuscenes/`)
+- `--version`: Dataset version: `v1.0-trainval`, `v1.0-test`, or `v1.0-mini` (default: `v1.0-trainval`)
+- `--scenes`: Scene name(s) to process, repeatable (default: `scene-0102`, `scene-0103`)
+- `--results_file`: Output HDF5 file path (default: `all_frames.h5`)
+
+Process multiple scenes:
+
+```bash
+python3 uniad.py --results_file my_results.h5 \
+	--scenes scene-0655 --scenes scene-0796 --scenes scene-0916
+```
+
+### Step 2: Visualize Results
+
+Load HDF5 file and generate visualizations:
+
+```bash
+python3 uniad.py --visualize all_frames.h5 \
+	--data_root data/nuscenes/ \
+	--version v1.0-trainval \
+	--vis_scenes scene-0103
+```
+
+This skips model inference and loads results directly from HDF5. Output includes:
+- Visualization images (JPG) in `vis_output/` directory
+- Video file as `output.avi`
+
+Parameters:
+
+- `--visualize`: Path to HDF5 file (required)
+- `--vis_scenes`: Scene(s) to visualize, repeatable (default: all scenes in HDF5)
+- `--data_root`: Path to NuScenes dataset root (default: `data/nuscenes/`)
+- `--version`: Dataset version (default: `v1.0-trainval`)
+
+Note: Frame rate is set to 4 fps and downsample factor is 2 (hardcoded).
+
+Visualize specific scene:
+
+```bash
+python3 uniad.py --visualize all_frames.h5 --vis_scenes scene-0102
+```
+
+Visualize all scenes:
+
+```bash
+python3 uniad.py --visualize all_frames.h5
+```
+
+### Available Scenes
+
+v1.0-trainval (850 scenes):
+- Training: `scene-0001` to `scene-0700` (700 scenes)
+- Validation: `scene-0701` to `scene-0850` (150 scenes)
+- Recommended for testing: `scene-0102`, `scene-0103`, `scene-0655`, `scene-0796`, `scene-0916`
+
+v1.0-test (150 scenes):
+- `scene-0001` to `scene-0150`
+
+v1.0-mini (10 scenes, for quick testing):
+- `scene-0061`, `scene-0103`, `scene-0553`, `scene-0655`, `scene-0757`, `scene-0796`, `scene-0916`, `scene-1077`, `scene-1094`, `scene-1100`
+
+## HDF5 Output Format
+
+The HDF5 file contains per-frame results organized as sequential groups:
+
+- Groups: `frame_000000`, `frame_000001`, ...
+- Attributes: `token`, `scene_token`, `command`
+- Datasets: `boxes_3d`, `scores_3d`, `labels_3d`, `track_ids`, `traj`, `traj_scores`, `sdc_boxes_3d`, `sdc_scores_3d`, `planning_traj`
+
+## Reference
+
+- [UniAD: Unified Driving](https://github.com/OpenDriveLab/UniAD)
diff --git a/autonomous_driving/uniad/collision_optimization.py b/autonomous_driving/uniad/collision_optimization.py
new file mode 100644
index 000000000..6a7926405
--- /dev/null
+++ b/autonomous_driving/uniad/collision_optimization.py
@@ -0,0 +1,118 @@
+# ---------------------------------------------------------------------------------#
+# UniAD: Planning-oriented Autonomous Driving (https://arxiv.org/abs/2212.10156)  #
+# Source code: https://github.com/OpenDriveLab/UniAD                              #
+# Copyright (c) OpenDriveLab. All rights reserved.                                #
+# ---------------------------------------------------------------------------------#
+
+from typing import Sequence, Tuple
+
+import numpy as np
+import numpy.typing as npt
+from casadi import DM, Opti, OptiSol, exp, sumsqr, vertcat
+
+Pose = Tuple[float, float, float]  # (x, y, yaw)
+
+
+class CollisionNonlinearOptimizer:
+    """
+    Optimize planned trajectory with predicted occupancy
+    Solved with direct multiple-shooting.
+    modified from https://github.com/motional/nuplan-devkit
+    :param trajectory_len: trajectory length
+    :param dt: timestep (sec)
+    """
+
+    def __init__(
+        self, trajectory_len: int, dt: float, sigma, alpha_collision, obj_pixel_pos
+    ):
+        """
+        :param trajectory_len: the length of trajectory to be optimized.
+        :param dt: the time interval between trajectory points.
+        """
+        self.dt = dt
+        self.trajectory_len = trajectory_len
+        self.current_index = 0
+        self.sigma = sigma
+        self.alpha_collision = alpha_collision
+        self.obj_pixel_pos = obj_pixel_pos
+        # Use a array of dts to make it compatible to situations with varying dts across different time steps.
+        self._dts: npt.NDArray[np.float32] = np.asarray([[dt] * trajectory_len])
+        self._init_optimization()
+
+    def _init_optimization(self) -> None:
+        """
+        Initialize related variables and constraints for optimization.
+        """
+        self.nx = 2  # state dim
+
+        self._optimizer = Opti()  # Optimization problem
+        self._create_decision_variables()
+        self._create_parameters()
+        self._set_objective()
+
+        # Set default solver options (quiet)
+        self._optimizer.solver(
+            "ipopt", {"ipopt.print_level": 0, "print_time": 0, "ipopt.sb": "yes"}
+        )
+
+    def set_reference_trajectory(self, reference_trajectory: Sequence[Pose]) -> None:
+        """
+        Set the reference trajectory that the smoother is trying to loosely track.
+        :param x_curr: current state of size nx (x, y)
+        :param reference_trajectory: N x 3 reference, where the second dim is for (x, y)
+        """
+        self._optimizer.set_value(self.ref_traj, DM(reference_trajectory).T)
+        self._set_initial_guess(reference_trajectory)
+
+    def solve(self) -> OptiSol:
+        """
+        Solve the optimization problem. Assumes the reference trajectory was already set.
+        :return Casadi optimization class
+        """
+        return self._optimizer.solve()
+
+    def _create_decision_variables(self) -> None:
+        """
+        Define the decision variables for the trajectory optimization.
+        """
+        # State trajectory (x, y)
+        self.state = self._optimizer.variable(self.nx, self.trajectory_len)
+        self.position_x = self.state[0, :]
+        self.position_y = self.state[1, :]
+
+    def _create_parameters(self) -> None:
+        """
+        Define the expert trjactory and current position for the trajectory optimizaiton.
+        """
+        self.ref_traj = self._optimizer.parameter(2, self.trajectory_len)  # (x, y)
+
+    def _set_objective(self) -> None:
+        """Set the objective function. Use care when modifying these weights."""
+        # Follow reference, minimize control rates and absolute inputs
+        alpha_xy = 1.0
+        cost_stage = alpha_xy * sumsqr(
+            self.ref_traj[:2, :] - vertcat(self.position_x, self.position_y)
+        )
+
+        alpha_collision = self.alpha_collision
+
+        cost_collision = 0
+        normalizer = 1 / (2.507 * self.sigma)
+        # TODO: vectorize this
+        for t in range(len(self.obj_pixel_pos)):
+            x, y = self.position_x[t], self.position_y[t]
+            for i in range(len(self.obj_pixel_pos[t])):
+                col_x, col_y = self.obj_pixel_pos[t][i]
+                cost_collision += (
+                    alpha_collision
+                    * normalizer
+                    * exp(-((x - col_x) ** 2 + (y - col_y) ** 2) / 2 / self.sigma**2)
+                )
+        self._optimizer.minimize(cost_stage + cost_collision)
+
+    def _set_initial_guess(self, reference_trajectory: Sequence[Pose]) -> None:
+        """Set a warm-start for the solver based on the reference trajectory."""
+        # Initialize state guess based on reference
+        self._optimizer.set_initial(
+            self.state[:2, :], DM(reference_trajectory).T
+        )  # (x, y, yaw)
diff --git a/autonomous_driving/uniad/lidar_box3d.py b/autonomous_driving/uniad/lidar_box3d.py
new file mode 100644
index 000000000..4c3a37e0c
--- /dev/null
+++ b/autonomous_driving/uniad/lidar_box3d.py
@@ -0,0 +1,153 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import abstractmethod
+
+import numpy as np
+
+
+class BaseInstance3DBoxes(object):
+    """Base class for 3D Boxes.
+
+    Note:
+        The box is bottom centered, i.e. the relative position of origin in
+        the box is (0.5, 0.5, 0).
+
+    Args:
+        tensor (torch.Tensor | np.ndarray | list): a N x box_dim matrix.
+        box_dim (int): Number of the dimension of a box.
+            Each row is (x, y, z, x_size, y_size, z_size, yaw).
+            Defaults to 7.
+        with_yaw (bool): Whether the box is with yaw rotation.
+            If False, the value of yaw will be set to 0 as minmax boxes.
+            Defaults to True.
+        origin (tuple[float], optional): Relative position of the box origin.
+            Defaults to (0.5, 0.5, 0). This will guide the box be converted to
+            (0.5, 0.5, 0) mode.
+
+    Attributes:
+        tensor (torch.Tensor): Float matrix of N x box_dim.
+        box_dim (int): Integer indicating the dimension of a box.
+            Each row is (x, y, z, x_size, y_size, z_size, yaw, ...).
+        with_yaw (bool): If True, the value of yaw will be set to 0 as minmax
+            boxes.
+    """
+
+    def __init__(self, tensor, box_dim=7, with_yaw=True, origin=(0.5, 0.5, 0)):
+        tensor = np.asarray(tensor, dtype=np.float32)
+        if tensor.size == 0:
+            tensor = tensor.reshape((0, box_dim)).astype(np.float32)
+        # assert tensor.ndim == 2 and tensor.shape[-1] == box_dim, tensor.shape
+
+        if tensor.shape[-1] == 6:
+            # If the dimension of boxes is 6, we expand box_dim by padding
+            # 0 as a fake yaw and set with_yaw to False.
+            assert box_dim == 6
+            fake_rot = np.zeros((tensor.shape[0], 1), dtype=tensor.dtype)
+            tensor = np.concatenate((tensor, fake_rot), axis=-1)
+            self.box_dim = box_dim + 1
+            self.with_yaw = False
+        else:
+            self.box_dim = box_dim
+            self.with_yaw = with_yaw
+        self.tensor = tensor.copy()
+
+        if origin != (0.5, 0.5, 0):
+            dst = np.array((0.5, 0.5, 0))
+            src = np.array(origin)
+            self.tensor[:, :3] += self.tensor[:, 3:6] * (dst - src)
+
+    @property
+    def volume(self):
+        """np.array: A vector with volume of each box."""
+        return self.tensor[:, 3] * self.tensor[:, 4] * self.tensor[:, 5]
+
+    @property
+    def dims(self):
+        """np.array: Size dimensions of each box in shape (N, 3)."""
+        return self.tensor[:, 3:6]
+
+    @property
+    def yaw(self):
+        """np.array: A vector with yaw of each box in shape (N, )."""
+        return self.tensor[:, 6]
+
+    @property
+    def bottom_center(self):
+        """np.array: A tensor with center of each box in shape (N, 3)."""
+        return self.tensor[:, :3]
+
+    @abstractmethod
+    def rotate(self, angle, points=None):
+        """Rotate boxes with points (optional) with the given angle or rotation
+        matrix.
+
+        Args:
+            angle (float | torch.Tensor | np.ndarray):
+                Rotation angle or rotation matrix.
+            points (torch.Tensor | numpy.ndarray |
+                :obj:`BasePoints`, optional):
+                Points to rotate. Defaults to None.
+        """
+        pass
+
+    def translate(self, trans_vector):
+        """Translate boxes with the given translation vector.
+
+        Args:
+            trans_vector (np.array): Translation vector of size (1, 3).
+        """
+        if not isinstance(trans_vector, np.ndarray):
+            trans_vector = np.array(trans_vector)
+        self.tensor[:, :3] += trans_vector
+
+    def clone(self):
+        """Clone the Boxes.
+
+        Returns:
+            :obj:`BaseInstance3DBoxes`: Box object with the same properties
+                as self.
+        """
+        original_type = type(self)
+        return original_type(
+            self.tensor.copy(), box_dim=self.box_dim, with_yaw=self.with_yaw
+        )
+
+
+class LiDARInstance3DBoxes(BaseInstance3DBoxes):
+    """3D boxes of instances in LIDAR coordinates.
+
+    Coordinates in LiDAR:
+
+    .. code-block:: none
+
+                                up z    x front (yaw=0)
+                                   ^   ^
+                                   |  /
+                                   | /
+       (yaw=0.5*pi) left y <------ 0
+
+    The relative coordinate of bottom center in a LiDAR box is (0.5, 0.5, 0),
+    and the yaw is around the z axis, thus the rotation axis=2.
+    The yaw is 0 at the positive direction of x axis, and increases from
+    the positive direction of x to the positive direction of y.
+
+    A refactor is ongoing to make the three coordinate systems
+    easier to understand and convert between each other.
+
+    Attributes:
+        tensor (np.ndarray): Float matrix of N x box_dim.
+        box_dim (int): Integer indicating the dimension of a box.
+            Each row is (x, y, z, x_size, y_size, z_size, yaw, ...).
+        with_yaw (bool): If True, the value of yaw will be set to 0 as minmax
+            boxes.
+    """
+
+    YAW_AXIS = 2
+
+    @property
+    def gravity_center(self):
+        """np.array: A tensor with center of each box in shape (N, 3)."""
+        bottom_center = self.bottom_center
+        gravity_center = np.zeros_like(bottom_center)
+        gravity_center[:, :2] = bottom_center[:, :2]
+        gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5
+        return gravity_center
diff --git a/autonomous_driving/uniad/nuscenes_dataset.py b/autonomous_driving/uniad/nuscenes_dataset.py
new file mode 100644
index 000000000..3cf580dfb
--- /dev/null
+++ b/autonomous_driving/uniad/nuscenes_dataset.py
@@ -0,0 +1,719 @@
+import os
+import pickle
+
+import cv2
+import numpy as np
+import tqdm
+from nuscenes import NuScenes
+from nuscenes.can_bus.can_bus_api import NuScenesCanBus
+from nuscenes.eval.common.utils import Quaternion, quaternion_yaw
+from nuscenes.prediction import PredictHelper
+from nuscenes.utils import splits
+from torch.utils.data import Dataset
+
+# isort: off
+from lidar_box3d import LiDARInstance3DBoxes
+
+
+def obtain_sensor2top(
+    nusc, sensor_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, sensor_type="lidar"
+):
+    """Obtain the info with RT matric from general sensor to Top LiDAR.
+
+    Args:
+        nusc (class): Dataset class in the nuScenes dataset.
+        sensor_token (str): Sample data token corresponding to the
+            specific sensor type.
+        l2e_t (np.ndarray): Translation from lidar to ego in shape (1, 3).
+        l2e_r_mat (np.ndarray): Rotation matrix from lidar to ego
+            in shape (3, 3).
+        e2g_t (np.ndarray): Translation from ego to global in shape (1, 3).
+        e2g_r_mat (np.ndarray): Rotation matrix from ego to global
+            in shape (3, 3).
+        sensor_type (str): Sensor to calibrate. Default: 'lidar'.
+
+    Returns:
+        sweep (dict): Sweep information after transformation.
+    """
+    sd_rec = nusc.get("sample_data", sensor_token)
+    cs_record = nusc.get("calibrated_sensor", sd_rec["calibrated_sensor_token"])
+    pose_record = nusc.get("ego_pose", sd_rec["ego_pose_token"])
+    data_path = str(nusc.get_sample_data_path(sd_rec["token"]))
+    if nusc.dataroot in data_path:
+        data_path = data_path.split(f"{nusc.dataroot}")[-1]  # relative path
+    sweep = {
+        "data_path": data_path,
+        "type": sensor_type,
+        "sample_data_token": sd_rec["token"],
+        "sensor2ego_translation": cs_record["translation"],
+        "sensor2ego_rotation": cs_record["rotation"],
+        "ego2global_translation": pose_record["translation"],
+        "ego2global_rotation": pose_record["rotation"],
+        "timestamp": sd_rec["timestamp"],
+    }
+
+    l2e_r_s = sweep["sensor2ego_rotation"]
+    l2e_t_s = sweep["sensor2ego_translation"]
+    e2g_r_s = sweep["ego2global_rotation"]
+    e2g_t_s = sweep["ego2global_translation"]
+
+    # obtain the RT from sensor to Top LiDAR
+    # sweep->ego->global->ego'->lidar
+    l2e_r_s_mat = Quaternion(l2e_r_s).rotation_matrix
+    e2g_r_s_mat = Quaternion(e2g_r_s).rotation_matrix
+    R = (l2e_r_s_mat.T @ e2g_r_s_mat.T) @ (
+        np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+    )
+    T = (l2e_t_s @ e2g_r_s_mat.T + e2g_t_s) @ (
+        np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+    )
+    T -= (
+        e2g_t @ (np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T)
+        + l2e_t @ np.linalg.inv(l2e_r_mat).T
+    )
+    sweep["sensor2lidar_rotation"] = R.T  # points @ R.T + T
+    sweep["sensor2lidar_translation"] = T
+    return sweep
+
+
+def get_can_bus_info(nusc, nusc_can_bus, sample):
+    scene_name = nusc.get("scene", sample["scene_token"])["name"]
+    sample_timestamp = sample["timestamp"]
+    try:
+        pose_list = nusc_can_bus.get_messages(scene_name, "pose")
+    except:
+        return np.zeros(18)  # server scenes do not have can bus information.
+    can_bus = []
+    # during each scene, the first timestamp of can_bus may be large than the first sample's timestamp
+    last_pose = pose_list[0]
+    for i, pose in enumerate(pose_list):
+        if pose["utime"] > sample_timestamp:
+            break
+        last_pose = pose
+    _ = last_pose.pop("utime")  # useless
+    pos = last_pose.pop("pos")
+    rotation = last_pose.pop("orientation")
+    can_bus.extend(pos)
+    can_bus.extend(rotation)
+    for key in last_pose.keys():
+        can_bus.extend(pose[key])  # 16 elements
+    can_bus.extend([0.0, 0.0])
+    return np.array(can_bus)
+
+
+def get_future_traj_info(nusc, sample, predict_steps=16):
+    sample_token = sample["token"]
+    ann_tokens = np.array(sample["anns"])
+    sd_rec = nusc.get("sample", sample_token)
+    fut_traj_all = []
+    fut_traj_valid_mask_all = []
+    _, boxes, _ = nusc.get_sample_data(
+        sd_rec["data"]["LIDAR_TOP"], selected_anntokens=ann_tokens
+    )
+    predict_helper = PredictHelper(nusc)
+    for i, ann_token in enumerate(ann_tokens):
+        box = boxes[i]
+        instance_token = nusc.get("sample_annotation", ann_token)["instance_token"]
+        fut_traj_local = predict_helper.get_future_for_agent(
+            instance_token,
+            sample_token,
+            seconds=predict_steps // 2,
+            in_agent_frame=True,
+        )
+
+        fut_traj = np.zeros((predict_steps, 2))
+        fut_traj_valid_mask = np.zeros((predict_steps, 2))
+        if fut_traj_local.shape[0] > 0:
+            fut_traj_scence_centric = fut_traj_local
+            fut_traj[: fut_traj_scence_centric.shape[0], :] = fut_traj_scence_centric
+            fut_traj_valid_mask[: fut_traj_scence_centric.shape[0], :] = 1
+        fut_traj_all.append(fut_traj)
+        fut_traj_valid_mask_all.append(fut_traj_valid_mask)
+    if len(ann_tokens) > 0:
+        fut_traj_all = np.stack(fut_traj_all, axis=0)
+        fut_traj_valid_mask_all = np.stack(fut_traj_valid_mask_all, axis=0)
+    else:
+        fut_traj_all = np.zeros((0, predict_steps, 2))
+        fut_traj_valid_mask_all = np.zeros((0, predict_steps, 2))
+    return fut_traj_all, fut_traj_valid_mask_all
+
+
+class NuScenesTraj:
+    def __init__(
+        self,
+        nusc,
+    ):
+        self.nusc = nusc
+        self.prepare_sdc_vel_info()
+        self.planning_steps = 6
+        self.with_velocity = True
+
+    def get_vel_transform_mats(self, sample):
+        sd_rec = self.nusc.get("sample_data", sample["data"]["LIDAR_TOP"])
+        cs_record = self.nusc.get(
+            "calibrated_sensor", sd_rec["calibrated_sensor_token"]
+        )
+        pose_record = self.nusc.get("ego_pose", sd_rec["ego_pose_token"])
+
+        l2e_r = cs_record["rotation"]
+        l2e_t = cs_record["translation"]
+        e2g_r = pose_record["rotation"]
+        e2g_t = pose_record["translation"]
+        l2e_r_mat = Quaternion(l2e_r).rotation_matrix
+        e2g_r_mat = Quaternion(e2g_r).rotation_matrix
+
+        return l2e_r_mat, e2g_r_mat
+
+    def get_vel_and_time(self, sample):
+        lidar_token = sample["data"]["LIDAR_TOP"]
+        lidar_top = self.nusc.get("sample_data", lidar_token)
+        pose = self.nusc.get("ego_pose", lidar_top["ego_pose_token"])
+        xyz = pose["translation"]
+        timestamp = sample["timestamp"]
+        return xyz, timestamp
+
+    def prepare_sdc_vel_info(self):
+        # generate sdc velocity info for all samples
+        # Note that these velocity values are converted from
+        # global frame to lidar frame
+        # as aligned with bbox gts
+
+        self.sdc_vel_info = {}
+        for scene in self.nusc.scene:
+            scene_token = scene["token"]
+
+            # we cannot infer vel for the last sample, therefore we skip it
+            last_sample_token = scene["last_sample_token"]
+            sample_token = scene["first_sample_token"]
+            sample = self.nusc.get("sample", sample_token)
+            xyz, time = self.get_vel_and_time(sample)
+            while sample["token"] != last_sample_token:
+                next_sample_token = sample["next"]
+                next_sample = self.nusc.get("sample", next_sample_token)
+                next_xyz, next_time = self.get_vel_and_time(next_sample)
+                dc = np.array(next_xyz) - np.array(xyz)
+                dt = (next_time - time) / 1e6
+                vel = dc / dt
+
+                # global frame to lidar frame
+                l2e_r_mat, e2g_r_mat = self.get_vel_transform_mats(sample)
+                vel = vel @ np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+                vel = vel[:2]
+
+                self.sdc_vel_info[sample["token"]] = vel
+                xyz, time = next_xyz, next_time
+                sample = next_sample
+
+            # set first sample's vel equal to second sample's
+            last_sample = self.nusc.get("sample", last_sample_token)
+            second_last_sample_token = last_sample["prev"]
+            self.sdc_vel_info[last_sample_token] = self.sdc_vel_info[
+                second_last_sample_token
+            ]
+
+    def generate_sdc_info(self, sdc_vel):
+        """Generate SDC (Self-Driving Car) bounding box info.
+
+        Args:
+            sdc_vel: SDC velocity [vx, vy]
+
+        Returns:
+            LiDARInstance3DBoxes: SDC bounding box
+        """
+
+        # sdc dim from https://forum.nuscenes.org/t/dimensions-of-the-ego-vehicle-used-to-gather-data/550
+        # TODO(box3d): we have changed yaw to mmdet3d 1.0.0rc6 format, wlh->lwh -pi->0.5pi
+        psudo_sdc_bbox = np.array([0.0, 0.0, 0.0, 4.08, 1.73, 1.56, 0.5 * np.pi])
+        if self.with_velocity:
+            psudo_sdc_bbox = np.concatenate([psudo_sdc_bbox, sdc_vel], axis=-1)
+        gt_bboxes_3d = np.array([psudo_sdc_bbox]).astype(np.float32)
+
+        # the nuscenes box center is [0.5, 0.5, 0.5], we change it to be
+        # the same as KITTI (0.5, 0.5, 0)
+        gt_bboxes_3d = LiDARInstance3DBoxes(
+            gt_bboxes_3d, box_dim=gt_bboxes_3d.shape[-1], origin=(0.5, 0.5, 0.5)
+        )
+
+        return gt_bboxes_3d
+
+    def get_l2g_transform(self, sample):
+        sd_rec = self.nusc.get("sample_data", sample["data"]["LIDAR_TOP"])
+        cs_record = self.nusc.get(
+            "calibrated_sensor", sd_rec["calibrated_sensor_token"]
+        )
+        pose_record = self.nusc.get("ego_pose", sd_rec["ego_pose_token"])
+
+        l2e_r = cs_record["rotation"]
+        l2e_t = np.array(cs_record["translation"])
+        e2g_r = pose_record["rotation"]
+        e2g_t = np.array(pose_record["translation"])
+        l2e_r_mat = Quaternion(l2e_r).rotation_matrix
+        e2g_r_mat = Quaternion(e2g_r).rotation_matrix
+
+        return l2e_r_mat, l2e_t, e2g_r_mat, e2g_t
+
+    def get_sdc_planning_label(self, sample_token):
+        sd_rec = self.nusc.get("sample", sample_token)
+        l2e_r_mat_init, l2e_t_init, e2g_r_mat_init, e2g_t_init = self.get_l2g_transform(
+            sd_rec
+        )
+
+        planning = []
+        for _ in range(self.planning_steps):
+            next_annotation_token = sd_rec["next"]
+            if next_annotation_token == "":
+                break
+            sd_rec = self.nusc.get("sample", next_annotation_token)
+            l2e_r_mat_curr, l2e_t_curr, e2g_r_mat_curr, e2g_t_curr = (
+                self.get_l2g_transform(sd_rec)
+            )  # (lidar to global at current frame)
+
+            # bbox of sdc under current lidar frame
+            next_bbox3d = self.generate_sdc_info(
+                self.sdc_vel_info[next_annotation_token]
+            )
+
+            # to bbox under curr ego frame
+            next_bbox3d.rotate(l2e_r_mat_curr.T)
+            next_bbox3d.translate(l2e_t_curr)
+
+            # to bbox under world frame
+            next_bbox3d.rotate(e2g_r_mat_curr.T)
+            next_bbox3d.translate(e2g_t_curr)
+
+            # to bbox under initial ego frame, first inverse translate, then inverse rotate
+            next_bbox3d.translate(-e2g_t_init)
+            m1 = np.linalg.inv(e2g_r_mat_init)
+            next_bbox3d.rotate(m1.T)
+
+            # to bbox under curr ego frame, first inverse translate, then inverse rotate
+            next_bbox3d.translate(-l2e_t_init)
+            m2 = np.linalg.inv(l2e_r_mat_init)
+            next_bbox3d.rotate(m2.T)
+
+            planning.append(next_bbox3d)
+
+        planning_all = np.zeros((1, self.planning_steps, 3))
+        planning_mask_all = np.zeros((1, self.planning_steps, 2))
+        n_valid_timestep = len(planning)
+        if n_valid_timestep > 0:
+            planning = [p.tensor.squeeze(0) for p in planning]
+            planning = np.stack(planning, axis=0)  # (valid_t, 9)
+            planning = planning[:, [0, 1, 6]]  # (x, y, yaw)
+            planning_all[:, :n_valid_timestep, :] = planning
+            planning_mask_all[:, :n_valid_timestep, :] = 1
+
+        mask = planning_mask_all[0].any(axis=1)
+        if mask.sum() == 0:
+            command = 2  #'FORWARD'
+        elif planning_all[0, mask][-1][0] >= 2:
+            command = 0  #'RIGHT'
+        elif planning_all[0, mask][-1][0] <= -2:
+            command = 1  #'LEFT'
+        else:
+            command = 2  #'FORWARD'
+
+        return planning_all, planning_mask_all, command
+
+
+class NuScenesDataset(Dataset):
+    NameMapping = {
+        "movable_object.barrier": "barrier",
+        "vehicle.bicycle": "bicycle",
+        "vehicle.bus.bendy": "bus",
+        "vehicle.bus.rigid": "bus",
+        "vehicle.car": "car",
+        "vehicle.construction": "construction_vehicle",
+        "vehicle.motorcycle": "motorcycle",
+        "human.pedestrian.adult": "pedestrian",
+        "human.pedestrian.child": "pedestrian",
+        "human.pedestrian.construction_worker": "pedestrian",
+        "human.pedestrian.police_officer": "pedestrian",
+        "movable_object.trafficcone": "traffic_cone",
+        "vehicle.trailer": "trailer",
+        "vehicle.truck": "truck",
+    }
+
+    CLASSES = (
+        "car",
+        "truck",
+        "construction_vehicle",
+        "bus",
+        "trailer",
+        "barrier",
+        "motorcycle",
+        "bicycle",
+        "pedestrian",
+        "traffic_cone",
+    )
+
+    def __init__(
+        self,
+        data_root=None,
+        version=None,
+        ann_file=None,
+        test_scenes=None,
+    ):
+        ### Custom3DDataset
+
+        if data_root is None:
+            self.data_root = "./"
+        elif data_root.endswith("/"):
+            self.data_root = data_root
+        else:
+            self.data_root = data_root + "/"
+
+        if ann_file is None:
+            self.version = version or "v1.0-trainval"
+            self.data_infos = self.nuscenes_data_prep(
+                self.version, self.data_root, test_scenes=test_scenes
+            )
+        else:
+            self.data_infos = self.load_annotations(ann_file)
+            self.nusc = NuScenes(
+                version=self.version, dataroot=self.data_root, verbose=True
+            )
+
+            # filter scenes if test_scenes is provided
+            if test_scenes is not None:
+                val_scenes = set()
+                for s in self.nusc.scene:
+                    if s["name"] not in test_scenes:
+                        continue
+                    val_scenes.add(s["token"])
+
+                self.data_infos = [
+                    s for s in self.data_infos if s["scene_token"] in val_scenes
+                ]
+
+        self.use_camera = True
+
+        ### NuScenesE2EDataset
+        self.traj_api = NuScenesTraj(self.nusc)
+
+    def __len__(self):
+        return len(self.data_infos)
+
+    def nuscenes_data_prep(self, version, data_root, test_scenes=None, max_sweeps=10):
+        self.nusc = NuScenes(version=version, dataroot=data_root, verbose=True)
+        nusc_can_bus = NuScenesCanBus(dataroot=data_root)
+
+        val_scenes = set()
+        for s in self.nusc.scene:
+            if test_scenes and s["name"] not in test_scenes:
+                continue
+            val_scenes.add(s["token"])
+
+        samples = [x for x in self.nusc.sample if x["scene_token"] in val_scenes]
+
+        val_nusc_infos = []
+        frame_idx = 0
+        for sample in tqdm.tqdm(samples, desc="Preparing data"):
+            lidar_token = sample["data"]["LIDAR_TOP"]
+            sd_rec = self.nusc.get("sample_data", sample["data"]["LIDAR_TOP"])
+            cs_record = self.nusc.get(
+                "calibrated_sensor", sd_rec["calibrated_sensor_token"]
+            )
+            pose_record = self.nusc.get("ego_pose", sd_rec["ego_pose_token"])
+            lidar_path, boxes, _ = self.nusc.get_sample_data(lidar_token)
+
+            can_bus = get_can_bus_info(self.nusc, nusc_can_bus, sample)
+            info = {
+                "token": sample["token"],
+                "can_bus": can_bus,
+                "sweeps": [],
+                "cams": dict(),
+                "scene_token": sample["scene_token"],  # temporal related info
+                "lidar2ego_translation": cs_record["translation"],
+                "lidar2ego_rotation": cs_record["rotation"],
+                "ego2global_translation": pose_record["translation"],
+                "ego2global_rotation": pose_record["rotation"],
+                "timestamp": sample["timestamp"],
+            }
+
+            if sample["next"] == "":
+                frame_idx = 0
+            else:
+                frame_idx += 1
+
+            l2e_r = info["lidar2ego_rotation"]
+            l2e_t = info["lidar2ego_translation"]
+            e2g_r = info["ego2global_rotation"]
+            e2g_t = info["ego2global_translation"]
+            l2e_r_mat = Quaternion(l2e_r).rotation_matrix
+            e2g_r_mat = Quaternion(e2g_r).rotation_matrix
+
+            # obtain 6 image's information per frame
+            camera_types = [
+                "CAM_FRONT",
+                "CAM_FRONT_RIGHT",
+                "CAM_FRONT_LEFT",
+                "CAM_BACK",
+                "CAM_BACK_LEFT",
+                "CAM_BACK_RIGHT",
+            ]
+            for cam in camera_types:
+                cam_token = sample["data"][cam]
+                cam_path, _, cam_intrinsic = self.nusc.get_sample_data(cam_token)
+                cam_info = obtain_sensor2top(
+                    self.nusc, cam_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, cam
+                )
+                cam_info.update(cam_intrinsic=cam_intrinsic)
+                info["cams"].update({cam: cam_info})
+
+            # obtain sweeps for a single key-frame
+            sd_rec = self.nusc.get("sample_data", sample["data"]["LIDAR_TOP"])
+            sweeps = []
+            while len(sweeps) < max_sweeps:
+                if not sd_rec["prev"] == "":
+                    sweep = obtain_sensor2top(
+                        self.nusc,
+                        sd_rec["prev"],
+                        l2e_t,
+                        l2e_r_mat,
+                        e2g_t,
+                        e2g_r_mat,
+                        "lidar",
+                    )
+                    sweeps.append(sweep)
+                    sd_rec = self.nusc.get("sample_data", sd_rec["prev"])
+                else:
+                    break
+            info["sweeps"] = sweeps
+            # obtain annotation
+            annotations = [
+                self.nusc.get("sample_annotation", token) for token in sample["anns"]
+            ]
+            locs = np.array([b.center for b in boxes]).reshape(-1, 3)
+            dims = np.array([b.wlh for b in boxes]).reshape(-1, 3)
+            rots = np.array([b.orientation.yaw_pitch_roll[0] for b in boxes]).reshape(
+                -1, 1
+            )
+            velocity = np.array(
+                [self.nusc.box_velocity(token)[:2] for token in sample["anns"]]
+            )
+            valid_flag = np.array(
+                [
+                    (anno["num_lidar_pts"] + anno["num_radar_pts"]) > 0
+                    for anno in annotations
+                ],
+                dtype=bool,
+            ).reshape(-1)
+            instance_inds = [
+                self.nusc.getind("instance", ann["instance_token"])
+                for ann in annotations
+            ]
+            future_traj_all, future_traj_valid_mask_all = get_future_traj_info(
+                self.nusc, sample
+            )
+            instance_tokens = [
+                ann["instance_token"] for ann in annotations
+            ]  # dtype('<U[length_of_str]')
+
+            # convert velo from global to lidar
+            for i in range(len(boxes)):
+                velo = np.array([*velocity[i], 0.0])
+                velo = velo @ np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+                velocity[i] = velo[:2]
+
+            names = [b.name for b in boxes]
+            for i in range(len(names)):
+                if names[i] in NuScenesDataset.NameMapping:
+                    names[i] = NuScenesDataset.NameMapping[names[i]]
+            names = np.array(names)
+            # instance_inds = [nusc.getind('instance', ann['instance_token']) for ann in annotations]
+            # TODO(box3d): convert gt_boxes to mmdet3d 1.0.0rc6 LiDARInstance3DBoxes format. [DONE]
+            gt_boxes = np.concatenate([locs, dims, rots], axis=1)
+            assert len(gt_boxes) == len(
+                annotations
+            ), f"{len(gt_boxes)}, {len(annotations)}"
+            info["gt_boxes"] = gt_boxes
+            info["gt_names"] = names
+            info["gt_velocity"] = velocity.reshape(-1, 2)
+            info["num_lidar_pts"] = np.array([a["num_lidar_pts"] for a in annotations])
+            info["num_radar_pts"] = np.array([a["num_radar_pts"] for a in annotations])
+            info["valid_flag"] = valid_flag
+            info["gt_inds"] = np.array(instance_inds)
+            info["gt_ins_tokens"] = np.array(instance_tokens)
+            info["fut_traj"] = future_traj_all
+            info["fut_traj_valid_mask"] = future_traj_valid_mask_all
+
+            # add visibility_tokens
+            visibility_tokens = [int(anno["visibility_token"]) for anno in annotations]
+            info["visibility_tokens"] = np.array(visibility_tokens)
+
+            val_nusc_infos.append(info)
+
+        return val_nusc_infos
+
+    def load_annotations(self, ann_file):
+        """Load annotations from ann_file."""
+        with open(ann_file, "rb") as f:
+            value_buf = f.read()
+            data = pickle.loads(value_buf)
+        data_infos = list(sorted(data["infos"], key=lambda e: e["timestamp"]))
+        self.version = data["metadata"]["version"]
+
+        return data_infos
+
+    def __getitem__(self, idx):
+        input_dict = self.get_data_info(idx)
+        data_dict = self.pipeline(input_dict)
+
+        meta_keys = [
+            "sample_idx",
+            "img_shape",
+            "lidar2img",
+            "scene_token",
+            "can_bus",
+        ]
+        img_metas = {}
+        for key in meta_keys:
+            if key in data_dict:
+                img_metas[key] = data_dict[key]
+        img_metas["lidar2img"] = np.array(img_metas["lidar2img"])
+        img_metas["img_shape"] = np.array(img_metas["img_shape"])
+        data_dict["img_metas"] = img_metas
+
+        return data_dict
+
+    def get_ann_info(self, index):
+        info = self.data_infos[index]
+
+        sdc_planning, sdc_planning_mask, command = self.traj_api.get_sdc_planning_label(
+            info["token"]
+        )
+        anns_results = dict(
+            sdc_planning=sdc_planning,
+            sdc_planning_mask=sdc_planning_mask,
+            command=command,
+        )
+
+        return anns_results
+
+    def get_data_info(self, index):
+        """Get data info according to the given index."""
+        info = self.data_infos[index]
+
+        input_dict = dict(
+            sample_idx=info["token"],
+            ego2global_translation=info["ego2global_translation"],
+            ego2global_rotation=info["ego2global_rotation"],
+            scene_token=info["scene_token"],
+            can_bus=info["can_bus"],
+            timestamp=info["timestamp"] / 1e6,
+        )
+
+        l2e_r = info["lidar2ego_rotation"]
+        l2e_t = info["lidar2ego_translation"]
+        e2g_r = info["ego2global_rotation"]
+        e2g_t = info["ego2global_translation"]
+        l2e_r_mat = Quaternion(l2e_r).rotation_matrix
+        e2g_r_mat = Quaternion(e2g_r).rotation_matrix
+
+        l2g_r_mat = l2e_r_mat.T @ e2g_r_mat.T
+        l2g_t = l2e_t @ e2g_r_mat.T + e2g_t
+
+        input_dict.update(
+            dict(l2g_r_mat=l2g_r_mat.astype(np.float32), l2g_t=l2g_t.astype(np.float32))
+        )
+
+        # use_camera
+        if self.use_camera:
+            image_paths = []
+            lidar2img_rts = []
+            for cam_type, cam_info in info["cams"].items():
+                image_paths.append(cam_info["data_path"])
+                # obtain lidar to image transformation matrix
+                lidar2cam_r = np.linalg.inv(cam_info["sensor2lidar_rotation"])
+                lidar2cam_t = cam_info["sensor2lidar_translation"] @ lidar2cam_r.T
+                lidar2cam_rt = np.eye(4)
+                lidar2cam_rt[:3, :3] = lidar2cam_r.T
+                lidar2cam_rt[3, :3] = -lidar2cam_t
+                intrinsic = cam_info["cam_intrinsic"]
+                viewpad = np.eye(4)
+                viewpad[: intrinsic.shape[0], : intrinsic.shape[1]] = intrinsic
+                lidar2img_rt = viewpad @ lidar2cam_rt.T
+                lidar2img_rts.append(lidar2img_rt)
+            input_dict["img_filename"] = image_paths
+            input_dict["lidar2img"] = lidar2img_rts
+
+        ann_info = self.get_ann_info(index)
+        input_dict["sdc_planning"] = ann_info["sdc_planning"]
+        input_dict["sdc_planning_mask"] = ann_info["sdc_planning_mask"]
+        input_dict["command"] = ann_info["command"]
+
+        rotation = Quaternion(input_dict["ego2global_rotation"])
+        translation = input_dict["ego2global_translation"]
+        can_bus = input_dict["can_bus"]
+        can_bus[:3] = translation
+        # NOTE(lty): fix can_bus format, in https://github.com/OpenDriveLab/UniAD/pull/214
+        can_bus[3:7] = rotation.elements
+        patch_angle = quaternion_yaw(rotation) / np.pi * 180
+        if patch_angle < 0:
+            patch_angle += 360
+        can_bus[-2] = patch_angle / 180 * np.pi
+        can_bus[-1] = patch_angle
+
+        return input_dict
+
+    def pipeline(self, data):
+        results = data
+
+        ### LoadMultiViewImageFromFilesInCeph
+
+        filename = results["img_filename"]
+        images_multiView = []
+        for img_path in filename:
+            img_path = os.path.join(self.data_root, img_path)
+            img = cv2.imread(img_path)
+            images_multiView.append(img)
+        img = np.stack(
+            images_multiView,
+            axis=-1,
+        )
+        img = img.astype(np.float32)
+
+        results["filename"] = filename
+        results["img"] = ori_img = [img[..., i] for i in range(img.shape[-1])]
+
+        ### NormalizeMultiviewImage
+
+        def imnormalize(img, mean, std):
+            mean = np.array(mean, dtype=np.float32)
+            std = np.array(std, dtype=np.float32)
+            img = (img - mean) / std
+            return img
+
+        mean = np.array([103.53, 116.28, 123.675])
+        std = np.array([1, 1, 1])
+        results["img"] = norm_img = [imnormalize(img, mean, std) for img in ori_img]
+
+        ### PadMultiViewImage
+
+        def impad_to_multiple(img, divisor, pad_val=0):
+            pad_h = int(np.ceil(img.shape[0] / divisor)) * divisor
+            pad_w = int(np.ceil(img.shape[1] / divisor)) * divisor
+            # shape = (pad_h, pad_w)
+            width = max(pad_w - img.shape[1], 0)
+            height = max(pad_h - img.shape[0], 0)
+            padding = (0, 0, width, height)
+
+            img = cv2.copyMakeBorder(
+                img,
+                padding[1],
+                padding[3],
+                padding[0],
+                padding[2],
+                cv2.BORDER_CONSTANT,
+                value=pad_val,
+            )
+            return img
+
+        size_divisor = 32
+        results["img"] = padded_img = [
+            impad_to_multiple(img, size_divisor) for img in norm_img
+        ]
+        results["img_shape"] = [img.shape for img in padded_img]
+
+        return results
diff --git a/autonomous_driving/uniad/render.py b/autonomous_driving/uniad/render.py
new file mode 100644
index 000000000..e63756a9a
--- /dev/null
+++ b/autonomous_driving/uniad/render.py
@@ -0,0 +1,700 @@
+import cv2
+import matplotlib
+import matplotlib.pyplot as plt
+import numpy as np
+from nuscenes import NuScenes
+from nuscenes.prediction import convert_local_coords_to_global
+from nuscenes.utils.data_classes import Box
+from nuscenes.utils.geometry_utils import box_in_image, view_points
+from PIL import Image
+from pyquaternion import Quaternion
+
+# isort: off
+from lidar_box3d import LiDARInstance3DBoxes
+
+
+color_mapping = (
+    np.asarray(
+        [
+            [0, 0, 0],
+            [255, 179, 0],
+            [128, 62, 117],
+            [255, 104, 0],
+            [166, 189, 215],
+            [193, 0, 32],
+            [206, 162, 98],
+            [129, 112, 102],
+            [0, 125, 52],
+            [246, 118, 142],
+            [0, 83, 138],
+            [255, 122, 92],
+            [83, 55, 122],
+            [255, 142, 0],
+            [179, 40, 81],
+            [244, 200, 0],
+            [127, 24, 13],
+            [147, 170, 0],
+            [89, 51, 21],
+            [241, 58, 19],
+            [35, 44, 22],
+            [112, 224, 255],
+            [70, 184, 160],
+            [153, 0, 255],
+            [71, 255, 0],
+            [255, 0, 163],
+            [255, 204, 0],
+            [0, 255, 235],
+            [255, 0, 235],
+            [255, 0, 122],
+            [255, 245, 0],
+            [10, 190, 212],
+            [214, 255, 0],
+            [0, 204, 255],
+            [20, 0, 255],
+            [255, 255, 0],
+            [0, 153, 255],
+            [0, 255, 204],
+            [41, 255, 0],
+            [173, 0, 255],
+            [0, 245, 255],
+            [71, 0, 255],
+            [0, 255, 184],
+            [0, 92, 255],
+            [184, 255, 0],
+            [255, 214, 0],
+            [25, 194, 194],
+            [92, 0, 255],
+            [220, 220, 220],
+            [255, 9, 92],
+            [112, 9, 255],
+            [8, 255, 214],
+            [255, 184, 6],
+            [10, 255, 71],
+            [255, 41, 10],
+            [7, 255, 255],
+            [224, 255, 8],
+            [102, 8, 255],
+            [255, 61, 6],
+            [255, 194, 7],
+            [0, 255, 20],
+            [255, 8, 41],
+            [255, 5, 153],
+            [6, 51, 255],
+            [235, 12, 255],
+            [160, 150, 20],
+            [0, 163, 255],
+            [140, 140, 140],
+            [250, 10, 15],
+            [20, 255, 0],
+        ]
+    )
+    / 255
+)
+
+
+class BaseRender:
+    """
+    BaseRender class
+    """
+
+    def __init__(self, figsize=(10, 10)):
+        self.figsize = figsize
+        self.fig, self.axes = None, None
+
+    def reset_canvas(self, dx=1, dy=1, tight_layout=False):
+        plt.close()
+        plt.gca().set_axis_off()
+        plt.axis("off")
+        self.fig, self.axes = plt.subplots(dx, dy, figsize=self.figsize)
+        if tight_layout:
+            plt.tight_layout()
+
+    def close_canvas(self):
+        plt.close()
+
+    def save_fig(self, filename=None):
+        plt.subplots_adjust(top=1, bottom=0, right=1, left=0, hspace=0, wspace=0)
+        plt.margins(0, 0)
+        if filename is not None:
+            print(f"saving to {filename}")
+            plt.savefig(filename)
+        # Return image as numpy array
+        self.fig.canvas.draw()
+        img = np.frombuffer(self.fig.canvas.tostring_rgb(), dtype=np.uint8)
+        img = img.reshape(self.fig.canvas.get_width_height()[::-1] + (3,))
+        return img
+
+
+class BEVRender(BaseRender):
+    """
+    Render class for BEV
+    """
+
+    def __init__(
+        self,
+        figsize=(20, 20),
+        margin: float = 50,
+        view: np.ndarray = np.eye(4),
+        show_gt_boxes=False,
+    ):
+        super(BEVRender, self).__init__(figsize)
+        self.margin = margin
+        self.view = view
+        self.show_gt_boxes = show_gt_boxes
+
+    def set_plot_cfg(self):
+        self.axes.set_xlim([-self.margin, self.margin])
+        self.axes.set_ylim([-self.margin, self.margin])
+        self.axes.set_aspect("equal")
+        self.axes.grid(False)
+
+    def render_pred_box_data(self, agent_prediction_list):
+        for pred_agent in agent_prediction_list:
+            c = np.array([0, 1, 0])
+            if (
+                hasattr(pred_agent, "pred_track_id")
+                and pred_agent.pred_track_id is not None
+            ):  # this is true
+                tr_id = pred_agent.pred_track_id
+                c = color_mapping[tr_id % len(color_mapping)]
+            pred_agent.nusc_box.render(axis=self.axes, view=self.view, colors=(c, c, c))
+            if pred_agent.is_sdc:
+                c = np.array([1, 0, 0])
+                pred_agent.nusc_box.render(
+                    axis=self.axes, view=self.view, colors=(c, c, c)
+                )
+
+    def render_pred_traj(self, agent_prediction_list, top_k=3):
+        for pred_agent in agent_prediction_list:
+            if pred_agent.is_sdc:
+                continue
+            sorted_ind = np.argsort(pred_agent.pred_traj_score)[
+                ::-1
+            ]  # from high to low
+            num_modes = len(sorted_ind)
+            sorted_traj = pred_agent.pred_traj[sorted_ind, :, :2]
+            sorted_score = pred_agent.pred_traj_score[sorted_ind]
+            norm_score = np.exp(sorted_score[0])
+
+            sorted_traj = np.concatenate(
+                [np.zeros((num_modes, 1, 2)), sorted_traj], axis=1
+            )
+            trans = pred_agent.pred_center
+            rot = Quaternion(axis=np.array([0, 0.0, 1.0]), angle=np.pi / 2)
+            vehicle_id_list = [0, 1, 2, 3, 4, 6, 7]
+            if pred_agent.pred_label in vehicle_id_list:
+                dot_size = 150
+            else:
+                dot_size = 25
+            for i in range(top_k - 1, -1, -1):
+                viz_traj = sorted_traj[i, :, :2]
+                viz_traj = convert_local_coords_to_global(viz_traj, trans, rot)
+                traj_score = np.exp(sorted_score[i]) / norm_score
+                self._render_traj(
+                    viz_traj,
+                    traj_score=traj_score,
+                    colormap="winter",
+                    dot_size=dot_size,
+                )
+
+    def render_planning_data(self, predicted_planning, show_command=False):
+        planning_traj = predicted_planning.pred_traj
+        planning_traj = np.concatenate([np.zeros((1, 2)), planning_traj], axis=0)
+        self._render_traj(planning_traj, colormap="autumn", dot_size=50)
+        if show_command:
+            self._render_command(predicted_planning.command)
+
+    def _render_traj(
+        self,
+        future_traj,
+        traj_score=1,
+        colormap="winter",
+        points_per_step=20,
+        line_color=None,
+        dot_color=None,
+        dot_size=25,
+    ):
+        total_steps = (len(future_traj) - 1) * points_per_step + 1
+        dot_colors = matplotlib.colormaps[colormap](np.linspace(0, 1, total_steps))[
+            :, :3
+        ]
+        dot_colors = dot_colors * traj_score + (1 - traj_score) * np.ones_like(
+            dot_colors
+        )
+        total_xy = np.zeros((total_steps, 2))
+        for i in range(total_steps - 1):
+            unit_vec = (
+                future_traj[i // points_per_step + 1]
+                - future_traj[i // points_per_step]
+            )
+            total_xy[i] = (
+                i / points_per_step - i // points_per_step
+            ) * unit_vec + future_traj[i // points_per_step]
+        total_xy[-1] = future_traj[-1]
+        self.axes.scatter(total_xy[:, 0], total_xy[:, 1], c=dot_colors, s=dot_size)
+
+    def _render_command(self, command):
+        command_dict = ["TURN RIGHT", "TURN LEFT", "KEEP FORWARD"]
+        self.axes.text(-48, -45, command_dict[int(command)], fontsize=45)
+
+    def render_sdc_car(self):
+        sdc_car_png = cv2.imread("resources/sdc_car.png")
+        sdc_car_png = cv2.cvtColor(sdc_car_png, cv2.COLOR_BGR2RGB)
+        self.axes.imshow(sdc_car_png, extent=(-1, 1, -2, 2))
+
+    def render_legend(self):
+        legend = cv2.imread("resources/legend.png")
+        legend = cv2.cvtColor(legend, cv2.COLOR_BGR2RGB)
+        self.axes.imshow(legend, extent=(23, 51.2, -50, -40))
+
+
+class CameraRender(BaseRender):
+    def __init__(self, figsize=(53.3333, 20), show_gt_boxes=False):
+        super().__init__(figsize=figsize)
+
+        self.cams = [
+            "CAM_FRONT_LEFT",
+            "CAM_FRONT",
+            "CAM_FRONT_RIGHT",
+            "CAM_BACK_RIGHT",
+            "CAM_BACK",
+            "CAM_BACK_LEFT",
+        ]
+        self.show_gt_boxes = show_gt_boxes
+
+    def get_axis(self, index):
+        """Retrieve the corresponding axis based on the index."""
+        return self.axes[index // 3, index % 3]
+
+    def project_to_cam(
+        self,
+        agent_prediction_list,
+        sample_data_token,
+        nusc,
+        lidar_cs_record,
+        project_traj=False,
+        cam=None,
+    ):
+        """Project predictions to camera view."""
+        _, cs_record, pose_record, cam_intrinsic, imsize = self.get_image_info(
+            sample_data_token, nusc
+        )
+        boxes = []
+        for agent in agent_prediction_list:
+            box = Box(
+                agent.pred_center,
+                agent.pred_dim,
+                Quaternion(axis=(0.0, 0.0, 1.0), radians=agent.pred_yaw),
+                name=agent.pred_label,
+                token="predicted",
+            )
+            box.is_sdc = agent.is_sdc
+            if project_traj:
+                box.pred_traj = np.zeros((agent.pred_traj_max.shape[0] + 1, 3))
+                box.pred_traj[:, 0] = agent.pred_center[0]
+                box.pred_traj[:, 1] = agent.pred_center[1]
+                box.pred_traj[:, 2] = agent.pred_center[2] - agent.pred_dim[2] / 2
+                box.pred_traj[1:, :2] += agent.pred_traj_max[:, :2]
+                box.pred_traj = (
+                    Quaternion(lidar_cs_record["rotation"]).rotation_matrix
+                    @ box.pred_traj.T
+                ).T
+                box.pred_traj += np.array(lidar_cs_record["translation"])[None, :]
+            box.rotate(Quaternion(lidar_cs_record["rotation"]))
+            box.translate(np.array(lidar_cs_record["translation"]))
+            boxes.append(box)
+        # Make list of Box objects including coord system transforms.
+
+        box_list = []
+        tr_id_list = []
+        for i, box in enumerate(boxes):
+            #  Move box to sensor coord system.
+            box.translate(-np.array(cs_record["translation"]))
+            box.rotate(Quaternion(cs_record["rotation"]).inverse)
+            if project_traj:
+                box.pred_traj += -np.array(cs_record["translation"])[None, :]
+                box.pred_traj = (
+                    Quaternion(cs_record["rotation"]).inverse.rotation_matrix
+                    @ box.pred_traj.T
+                ).T
+
+            tr_id = agent_prediction_list[i].pred_track_id
+            if box.is_sdc and cam == "CAM_FRONT":
+                box_list.append(box)
+            if not box_in_image(box, cam_intrinsic, imsize):
+                continue
+            box_list.append(box)
+            tr_id_list.append(tr_id)
+        return box_list, tr_id_list, cam_intrinsic, imsize
+
+    def render_image_data(self, sample_token, nusc):
+        """Load and annotate image based on the provided path."""
+        sample = nusc.get("sample", sample_token)
+        for i, cam in enumerate(self.cams):
+            sample_data_token = sample["data"][cam]
+            data_path, _, _, _, _ = self.get_image_info(sample_data_token, nusc)
+            image = self.load_image(data_path, cam)
+            self.update_image(image, i, cam)
+
+    def load_image(self, data_path, cam):
+        """Update the axis of the plot with the provided image."""
+        image = np.array(Image.open(data_path))
+        font = cv2.FONT_HERSHEY_SIMPLEX
+        org = (50, 60)
+        fontScale = 2
+        color = (0, 0, 0)
+        thickness = 4
+        return cv2.putText(
+            image, cam, org, font, fontScale, color, thickness, cv2.LINE_AA
+        )
+
+    def update_image(self, image, index, cam):
+        """Render image data for each camera."""
+        ax = self.get_axis(index)
+        ax.imshow(image)
+        plt.axis("off")
+        ax.axis("off")
+        ax.grid(False)
+
+    def render_pred_track_bbox(self, predicted_agent_list, sample_token, nusc):
+        """Render bounding box for predicted tracks."""
+        sample = nusc.get("sample", sample_token)
+        lidar_cs_record = nusc.get(
+            "calibrated_sensor",
+            nusc.get("sample_data", sample["data"]["LIDAR_TOP"])[
+                "calibrated_sensor_token"
+            ],
+        )
+        for i, cam in enumerate(self.cams):
+            sample_data_token = sample["data"][cam]
+            box_list, tr_id_list, camera_intrinsic, imsize = self.project_to_cam(
+                predicted_agent_list, sample_data_token, nusc, lidar_cs_record
+            )
+            for j, box in enumerate(box_list):
+                if box.is_sdc:
+                    continue
+                tr_id = tr_id_list[j]
+                if tr_id is None:
+                    tr_id = 0
+                c = color_mapping[tr_id % len(color_mapping)]
+                box.render(
+                    self.axes[i // 3, i % 3],
+                    view=camera_intrinsic,
+                    normalize=True,
+                    colors=(c, c, c),
+                )
+            # plot gt
+            if self.show_gt_boxes:
+                data_path, boxes, camera_intrinsic = nusc.get_sample_data(
+                    sample_data_token, selected_anntokens=sample["anns"]
+                )
+                for j, box in enumerate(boxes):
+                    c = [0, 1, 0]
+                    box.render(
+                        self.axes[i // 3, i % 3],
+                        view=camera_intrinsic,
+                        normalize=True,
+                        colors=(c, c, c),
+                    )
+            self.axes[i // 3, i % 3].set_xlim(0, imsize[0])
+            self.axes[i // 3, i % 3].set_ylim(imsize[1], 0)
+
+    def render_pred_traj(
+        self,
+        predicted_agent_list,
+        sample_token,
+        nusc,
+        render_sdc=False,
+        points_per_step=10,
+    ):
+        """Render predicted trajectories."""
+        sample = nusc.get("sample", sample_token)
+        lidar_cs_record = nusc.get(
+            "calibrated_sensor",
+            nusc.get("sample_data", sample["data"]["LIDAR_TOP"])[
+                "calibrated_sensor_token"
+            ],
+        )
+        for i, cam in enumerate(self.cams):
+            sample_data_token = sample["data"][cam]
+            box_list, tr_id_list, camera_intrinsic, imsize = self.project_to_cam(
+                predicted_agent_list,
+                sample_data_token,
+                nusc,
+                lidar_cs_record,
+                project_traj=True,
+                cam=cam,
+            )
+            for j, box in enumerate(box_list):
+                traj_points = box.pred_traj[:, :3]
+
+                total_steps = (len(traj_points) - 1) * points_per_step + 1
+                total_xy = np.zeros((total_steps, 3))
+                for k in range(total_steps - 1):
+                    unit_vec = (
+                        traj_points[k // points_per_step + 1]
+                        - traj_points[k // points_per_step]
+                    )
+                    total_xy[k] = (
+                        k / points_per_step - k // points_per_step
+                    ) * unit_vec + traj_points[k // points_per_step]
+                in_range_mask = total_xy[:, 2] > 0.1
+                traj_points = view_points(total_xy.T, camera_intrinsic, normalize=True)[
+                    :2, :
+                ]
+                traj_points = traj_points[:2, in_range_mask]
+                if box.is_sdc:
+                    if render_sdc:
+                        self.axes[i // 3, i % 3].scatter(
+                            traj_points[0], traj_points[1], color=(1, 0.5, 0), s=150
+                        )
+                    else:
+                        continue
+                else:
+                    tr_id = tr_id_list[j]
+                    if tr_id is None:
+                        tr_id = 0
+                    c = color_mapping[tr_id % len(color_mapping)]
+                    self.axes[i // 3, i % 3].scatter(
+                        traj_points[0], traj_points[1], color=c, s=15
+                    )
+            self.axes[i // 3, i % 3].set_xlim(0, imsize[0])
+            self.axes[i // 3, i % 3].set_ylim(imsize[1], 0)
+
+    def get_image_info(self, sample_data_token, nusc):
+        """Retrieve image information."""
+        sd_record = nusc.get("sample_data", sample_data_token)
+        cs_record = nusc.get("calibrated_sensor", sd_record["calibrated_sensor_token"])
+        sensor_record = nusc.get("sensor", cs_record["sensor_token"])
+        pose_record = nusc.get("ego_pose", sd_record["ego_pose_token"])
+
+        data_path = nusc.get_sample_data_path(sample_data_token)
+
+        if sensor_record["modality"] == "camera":
+            cam_intrinsic = np.array(cs_record["camera_intrinsic"])
+            imsize = (sd_record["width"], sd_record["height"])
+        else:
+            cam_intrinsic = None
+            imsize = None
+        return data_path, cs_record, pose_record, cam_intrinsic, imsize
+
+
+class AgentPredictionData:
+    """
+    Agent data class, includes bbox, traj, and occflow
+    """
+
+    def __init__(
+        self,
+        pred_score,
+        pred_label,
+        pred_center,
+        pred_dim,
+        pred_yaw,
+        pred_vel,
+        pred_traj,
+        pred_traj_score,
+        pred_track_id=None,
+        pred_occ_map=None,
+        is_sdc=False,
+        past_pred_traj=None,
+        command=None,
+        attn_mask=None,
+    ):
+        self.pred_score = pred_score
+        self.pred_label = pred_label
+        self.pred_center = pred_center
+        self.pred_dim = pred_dim
+        # self.pred_yaw = pred_yaw
+        self.pred_yaw = -pred_yaw - np.pi / 2
+        self.pred_vel = pred_vel
+        self.pred_traj = pred_traj
+        self.pred_traj_score = pred_traj_score
+        self.pred_track_id = pred_track_id
+        self.pred_occ_map = pred_occ_map
+        if self.pred_traj is not None:
+            if isinstance(self.pred_traj_score, int):
+                self.pred_traj_max = self.pred_traj
+            else:
+                self.pred_traj_max = self.pred_traj[self.pred_traj_score.argmax()]
+        else:
+            self.pred_traj_max = None
+        self.nusc_box = Box(
+            center=pred_center,
+            size=pred_dim,
+            orientation=Quaternion(axis=[0, 0, 1], radians=self.pred_yaw),
+            label=pred_label,
+            score=pred_score,
+        )
+        if is_sdc:
+            self.pred_center = [0, 0, -1.2 + 1.56 / 2]
+        self.is_sdc = is_sdc
+        self.past_pred_traj = past_pred_traj
+        self.command = command
+        self.attn_mask = attn_mask
+
+
+class Visualizer:
+    def __init__(
+        self, bbox_results, nuscenes=None, dataroot="data/nuscenes", version="v1.0-mini"
+    ):
+        if nuscenes is None:
+            self.nusc = NuScenes(version=version, dataroot=dataroot, verbose=True)
+        else:
+            self.nusc = nuscenes
+
+        self.with_planning = True
+        self.show_command = True
+        self.show_sdc_car = True
+        self.show_legend = True
+        self.with_pred_traj = True
+        self.with_pred_box = True
+        show_gt_boxes = False
+
+        self.token_set = set()
+        self.predictions = self._parse_predictions(bbox_results)
+        self.bev_render = BEVRender(show_gt_boxes=show_gt_boxes)
+        self.cam_render = CameraRender(show_gt_boxes=show_gt_boxes)
+
+    def _parse_predictions(self, bbox_results):
+        outputs = bbox_results
+        prediction_dict = dict()
+        for k in range(len(outputs)):
+            token = outputs[k]["token"]
+            self.token_set.add(token)
+
+            # detection
+            bboxes: LiDARInstance3DBoxes = outputs[k]["boxes_3d"]
+            scores = outputs[k]["scores_3d"]
+            labels = outputs[k]["labels_3d"]
+
+            track_scores = scores
+            track_labels = labels
+            track_boxes = bboxes.tensor
+
+            track_centers = bboxes.gravity_center
+            track_dims = bboxes.dims
+            track_yaw = bboxes.yaw
+
+            track_ids = outputs[k]["track_ids"]
+
+            # speed
+            track_velocity = bboxes.tensor[:, -2:]
+
+            # trajectories
+            trajs = outputs[k]["traj"]
+            traj_scores = outputs[k]["traj_scores"]
+
+            predicted_agent_list = []
+            for i in range(track_scores.shape[0]):
+                if track_scores[i] < 0.25:
+                    continue
+
+                if i < len(track_ids):
+                    track_id = track_ids[i]
+                else:
+                    track_id = 0
+
+                predicted_agent_list.append(
+                    AgentPredictionData(
+                        track_scores[i],
+                        track_labels[i],
+                        track_centers[i],
+                        track_dims[i],
+                        track_yaw[i],
+                        track_velocity[i],
+                        trajs[i],
+                        traj_scores[i],
+                        pred_track_id=track_id,
+                        pred_occ_map=None,
+                        past_pred_traj=None,
+                    )
+                )
+
+            # with planning
+            if self.with_planning:
+                # detection
+                bboxes: LiDARInstance3DBoxes = outputs[k]["sdc_boxes_3d"]
+                scores = outputs[k]["sdc_scores_3d"]
+                labels = 0
+
+                track_scores = scores
+                track_labels = labels
+
+                track_centers = bboxes.gravity_center
+                track_dims = bboxes.dims
+                track_yaw = bboxes.yaw
+                track_velocity = bboxes.tensor[:, -2:]
+
+                command = outputs[k]["command"]
+                planning_agent = AgentPredictionData(
+                    track_scores[0],
+                    track_labels,
+                    track_centers[0],
+                    track_dims[0],
+                    track_yaw[0],
+                    track_velocity[0],
+                    outputs[k]["planning_traj"][0],
+                    1,
+                    pred_track_id=-1,
+                    pred_occ_map=None,
+                    past_pred_traj=None,
+                    is_sdc=True,
+                    command=command,
+                )
+                predicted_agent_list.append(planning_agent)
+            else:
+                planning_agent = None
+            prediction_dict[token] = dict(
+                predicted_agent_list=predicted_agent_list,
+                predicted_map_seg=None,
+                predicted_planning=planning_agent,
+            )
+
+        return prediction_dict
+
+    def visualize_bev(self, sample_token):
+        self.bev_render.reset_canvas(dx=1, dy=1)
+        self.bev_render.set_plot_cfg()
+
+        if self.with_pred_box:
+            self.bev_render.render_pred_box_data(
+                self.predictions[sample_token]["predicted_agent_list"]
+            )
+        if self.with_pred_traj:
+            self.bev_render.render_pred_traj(
+                self.predictions[sample_token]["predicted_agent_list"]
+            )
+        if self.with_planning:
+            self.bev_render.render_pred_box_data(
+                [self.predictions[sample_token]["predicted_planning"]]
+            )
+            self.bev_render.render_planning_data(
+                self.predictions[sample_token]["predicted_planning"],
+                show_command=self.show_command,
+            )
+        if self.show_sdc_car:
+            self.bev_render.render_sdc_car()
+        if self.show_legend:
+            self.bev_render.render_legend()
+
+        img = self.bev_render.save_fig()
+        return img
+
+    def visualize_cam(self, sample_token):
+        self.cam_render.reset_canvas(dx=2, dy=3, tight_layout=True)
+        self.cam_render.render_image_data(sample_token, self.nusc)
+        self.cam_render.render_pred_track_bbox(
+            self.predictions[sample_token]["predicted_agent_list"],
+            sample_token,
+            self.nusc,
+        )
+        self.cam_render.render_pred_traj(
+            self.predictions[sample_token]["predicted_agent_list"],
+            sample_token,
+            self.nusc,
+            render_sdc=self.with_planning,
+        )
+
+        img = self.cam_render.save_fig()
+        return img
diff --git a/autonomous_driving/uniad/requirements.txt b/autonomous_driving/uniad/requirements.txt
new file mode 100644
index 000000000..8af343025
--- /dev/null
+++ b/autonomous_driving/uniad/requirements.txt
@@ -0,0 +1,4 @@
+h5py
+casadi==3.6.7
+nuscenes-devkit==1.1.11
+pyquaternion==0.9.9
\ No newline at end of file
diff --git a/autonomous_driving/uniad/resources/legend.png b/autonomous_driving/uniad/resources/legend.png
new file mode 100644
index 000000000..c34e3b37b
Binary files /dev/null and b/autonomous_driving/uniad/resources/legend.png differ
diff --git a/autonomous_driving/uniad/resources/sdc_car.png b/autonomous_driving/uniad/resources/sdc_car.png
new file mode 100644
index 000000000..df19cbc33
Binary files /dev/null and b/autonomous_driving/uniad/resources/sdc_car.png differ
diff --git a/autonomous_driving/uniad/track_instance.py b/autonomous_driving/uniad/track_instance.py
new file mode 100644
index 000000000..05134306e
--- /dev/null
+++ b/autonomous_driving/uniad/track_instance.py
@@ -0,0 +1,108 @@
+import itertools
+from typing import Any, Dict, List, Tuple, Union
+
+import numpy as np
+
+
+class Instances:
+    def __init__(self, image_size: Tuple[int, int], **kwargs: Any):
+        """
+        Args:
+            image_size (height, width): the spatial size of the image.
+            kwargs: fields to add to this `Instances`.
+        """
+        self._image_size = image_size
+        self._fields: Dict[str, Any] = {}
+        for k, v in kwargs.items():
+            self.set(k, v)
+
+    @property
+    def image_size(self) -> Tuple[int, int]:
+        """
+        Returns:
+            tuple: height, width
+        """
+        return self._image_size
+
+    def __setattr__(self, name: str, val: Any) -> None:
+        if name.startswith("_"):
+            super().__setattr__(name, val)
+        else:
+            self.set(name, val)
+
+    def __getattr__(self, name: str) -> Any:
+        return self._fields[name]
+
+    def set(self, name: str, value: Any) -> None:
+        self._fields[name] = value
+
+    def has(self, name: str) -> bool:
+        return name in self._fields
+
+    def remove(self, name: str) -> None:
+        del self._fields[name]
+
+    def get(self, name: str) -> Any:
+        return self._fields[name]
+
+    def __getitem__(self, item: Union[int, slice, np.ndarray]) -> "Instances":
+        """
+        Args:
+            item: an index-like object and will be used to index all the fields.
+        Returns:
+            If `item` is a string, return the data in the corresponding field.
+            Otherwise, returns an `Instances` where all fields are indexed by `item`.
+        """
+        if type(item) == int:
+            if item >= len(self) or item < -len(self):
+                raise IndexError("Instances index out of range!")
+            else:
+                item = slice(item, None, len(self))
+
+        ret = Instances(self._image_size)
+        for k, v in self._fields.items():
+            # if index by numpy boolean array
+            if k == "kalman_models" and isinstance(item, np.ndarray):
+                ret_list = []
+                for i, if_true in enumerate(item):
+                    if if_true:
+                        ret_list.append(self.kalman_models[i])
+                ret.set(k, ret_list)
+            else:
+                ret.set(k, v[item])
+        return ret
+
+    def __len__(self) -> int:
+        for v in self._fields.values():
+            # use __len__ because len() has to be int and is not friendly to tracing
+            return v.__len__()
+        raise NotImplementedError("Empty Instances does not support __len__!")
+
+    @staticmethod
+    def cat(instance_lists: List["Instances"]) -> "Instances":
+        """
+        Args:
+            instance_lists (list[Instances])
+        Returns:
+            Instances
+        """
+        if len(instance_lists) == 1:
+            return instance_lists[0]
+
+        image_size = instance_lists[0].image_size
+        ret = Instances(image_size)
+        for k in instance_lists[0]._fields.keys():
+            values = [i.get(k) for i in instance_lists]
+            v0 = values[0]
+            if isinstance(v0, np.ndarray):
+                values = np.concatenate(values, axis=0)
+            elif isinstance(v0, list):
+                values = list(itertools.chain(*values))
+            elif hasattr(type(v0), "cat"):
+                values = type(v0).cat(values)
+            else:
+                raise ValueError(
+                    "Unsupported type {} for concatenation".format(type(v0))
+                )
+            ret.set(k, values)
+        return ret
diff --git a/autonomous_driving/uniad/tracker.py b/autonomous_driving/uniad/tracker.py
new file mode 100644
index 000000000..92fa59c22
--- /dev/null
+++ b/autonomous_driving/uniad/tracker.py
@@ -0,0 +1,32 @@
+from track_instance import Instances
+
+
+class RuntimeTrackerBase(object):
+    def __init__(self, score_thresh=0.5, filter_score_thresh=0.4, miss_tolerance=5):
+        self.score_thresh = score_thresh
+        self.filter_score_thresh = filter_score_thresh
+        self.miss_tolerance = miss_tolerance
+        self.max_obj_id = 0
+
+    def clear(self):
+        self.max_obj_id = 0
+
+    def update(self, track_instances: Instances):
+        track_instances.disappear_time[track_instances.scores >= self.score_thresh] = 0
+        for i in range(len(track_instances)):
+            if (
+                track_instances.obj_idxes[i] == -1
+                and track_instances.scores[i] >= self.score_thresh
+            ):
+                # new track
+                track_instances.obj_idxes[i] = self.max_obj_id
+                self.max_obj_id += 1
+            elif (
+                track_instances.obj_idxes[i] >= 0
+                and track_instances.scores[i] < self.filter_score_thresh
+            ):
+                # sleep time ++
+                track_instances.disappear_time[i] += 1
+                if track_instances.disappear_time[i] >= self.miss_tolerance:
+                    # mark deaded tracklets: Set the obj_id to -1.
+                    track_instances.obj_idxes[i] = -1
diff --git a/autonomous_driving/uniad/uniad.py b/autonomous_driving/uniad/uniad.py
new file mode 100644
index 000000000..11c004186
--- /dev/null
+++ b/autonomous_driving/uniad/uniad.py
@@ -0,0 +1,1527 @@
+import copy
+import glob
+import os
+import sys
+import warnings
+from logging import getLogger
+
+import ailia
+import cv2
+import h5py
+import numpy as np
+import tqdm
+
+# import original modules
+sys.path.append("../../util")
+# isort: off
+from arg_utils import get_base_parser, get_savepath, update_parser
+from math_utils import sigmoid
+from model_utils import check_and_download_file, check_and_download_models
+
+# isort: on
+
+# import local modules
+from collision_optimization import CollisionNonlinearOptimizer
+from lidar_box3d import LiDARInstance3DBoxes
+from nuscenes_dataset import NuScenesDataset
+from render import Visualizer
+from track_instance import Instances
+from tracker import RuntimeTrackerBase
+
+logger = getLogger(__name__)
+
+
+# ======================
+# Parameters
+# ======================
+
+WEIGHT_BEV_ENC_PATH = "bev_encoder.msfix.onnx"
+MODEL_BEV_ENC_PATH = "bev_encoder.msfix.onnx.prototxt"
+WEIGHT_TRACK_HEAD_PATH = "track_head.onnx"
+MODEL_TRACK_HEAD_PATH = "track_head.onnx.prototxt"
+WEIGHT_MEMORY_BANK_PATH = "memory_bank.onnx"
+MODEL_MEMORY_BANK_PATH = "memory_bank.onnx.prototxt"
+WEIGHT_MEMORY_BANK_UPD_PATH = "memory_bank_update.onnx"
+MODEL_MEMORY_BANK_UPD_PATH = "memory_bank_update.onnx.prototxt"
+WEIGHT_QUERY_INTERACTION_PATH = "query_interact.onnx"
+MODEL_QUERY_INTERACTION_PATH = "query_interact.onnx.prototxt"
+WEIGHT_SEG_HEAD_PATH = "seg_head.onnx"
+MODEL_SEG_HEAD_PATH = "seg_head.onnx.prototxt"
+WEIGHT_MOTION_HEAD_PATH = "motion_head.onnx"
+MODEL_MOTION_HEAD_PATH = "motion_head.onnx.prototxt"
+WEIGHT_OCC_HEAD_PATH = "occ_head.onnx"
+MODEL_OCC_HEAD_PATH = "occ_head.onnx.prototxt"
+WEIGHT_PLANNING_HEAD_PATH = "planning_head.onnx"
+MODEL_PLANNING_HEAD_PATH = "planning_head.onnx.prototxt"
+QUERY_EMBEDDING_FILE = "resources/query_embedding.npy"
+REFERENCE_POINTS_WEIGHT_FILE = "resources/reference_points_weight.npy"
+REFERENCE_POINTS_BIAS_FILE = "resources/reference_points_bias.npy"
+REMOTE_PATH = "https://storage.googleapis.com/ailia-models/uniad/"
+
+DEFAULT_SCENES = ["scene-0102", "scene-0103"]
+DEFAULT_VIS_SCENES = ["scene-0103"]
+SAVE_IMAGE_PATH = "output_video.avi"
+
+
+# ======================
+# Arguemnt Parser Config
+# ======================
+
+parser = get_base_parser("UniAD", None, SAVE_IMAGE_PATH)
+parser.add_argument(
+    "--scenes",
+    type=str,
+    nargs="+",
+    action="append",
+    default=None,
+    help="Test scenes to process. Can be specified as: --scenes scene-0102 scene-0103 OR --scenes scene-0102 --scenes scene-0103. If not specified, all scenes will be processed.",
+)
+parser.add_argument(
+    "--vis_scenes",
+    type=str,
+    nargs="+",
+    action="append",
+    default=None,
+    help="Scenes to visualize. Can be specified as: --vis_scenes scene-0102 scene-0103 OR --vis_scenes scene-0102 --vis_scenes scene-0103. If not specified, all processed scenes will be visualized.",
+)
+parser.add_argument(
+    "--ann_file",
+    type=str,
+    default=None,
+    help="Path to annotation file. If not specified, data will be prepared from scratch.",
+)
+parser.add_argument(
+    "--data_root",
+    type=str,
+    default="data/nuscenes/",
+    help="Path to NuScenes dataset root directory.",
+)
+parser.add_argument(
+    "--version",
+    type=str,
+    default=None,
+    help="NuScenes dataset version (e.g., v1.0-trainval, v1.0-mini).",
+)
+parser.add_argument(
+    "--results_file",
+    type=str,
+    default="all_frames.h5",
+    help="Path to HDF5 file used to stream per-frame results.",
+)
+parser.add_argument(
+    "--visualize",
+    type=str,
+    default=None,
+    help="Path to HDF5 file to visualize (skip inference).",
+)
+parser.add_argument("--onnx", action="store_true", help="execute onnxruntime version.")
+args = update_parser(parser)
+
+
+# ======================
+# Secondary Functions
+# ======================
+
+
+def flatten_args(arg_list):
+    """Flatten nested lists from nargs='*' with action='append'."""
+    if arg_list is None:
+        return None
+    flattened = []
+    for item in arg_list:
+        if isinstance(item, list):
+            flattened.extend(item)
+        else:
+            flattened.append(item)
+    return flattened if flattened else None
+
+
+def save_frame_to_hdf5(h5_group, result, frame_count):
+    """Save a single frame result to HDF5 group.
+
+    Args:
+        h5_group: HDF5 file handle
+        result: Result dictionary containing frame data
+        frame_count: Frame index number
+    """
+    group = h5_group.create_group(f"frame_{frame_count:06d}")
+    group.attrs["token"] = result["token"]
+    group.attrs["scene_token"] = result["scene_token"]
+    group.attrs["command"] = result["command"]
+    group.create_dataset("boxes_3d", data=result["boxes_3d"].tensor)
+    group.create_dataset("scores_3d", data=result["scores_3d"])
+    group.create_dataset("labels_3d", data=result["labels_3d"])
+    group.create_dataset("track_ids", data=result["track_ids"])
+    group.create_dataset("traj", data=result["traj"])
+    group.create_dataset("traj_scores", data=result["traj_scores"])
+    group.create_dataset("sdc_boxes_3d", data=result["sdc_boxes_3d"].tensor)
+    group.create_dataset("sdc_scores_3d", data=result["sdc_scores_3d"])
+    group.create_dataset("planning_traj", data=result["planning_traj"])
+    # Flush to disk immediately
+    h5_group.flush()
+
+
+def load_frames_from_hdf5(hdf5_file):
+    """Load all frame results from HDF5 file.
+
+    Args:
+        hdf5_file: Path to HDF5 file
+
+    Returns:
+        List of result dictionaries
+    """
+    bbox_results = []
+    with h5py.File(hdf5_file, "r") as h5f:
+        for frame_key in sorted(h5f.keys()):
+            group = h5f[frame_key]
+            bbox_results.append(
+                {
+                    "token": group.attrs["token"],
+                    "scene_token": group.attrs["scene_token"],
+                    "command": group.attrs["command"],
+                    "boxes_3d": LiDARInstance3DBoxes(group["boxes_3d"][:], box_dim=9),
+                    "scores_3d": group["scores_3d"][:],
+                    "labels_3d": group["labels_3d"][:],
+                    "track_ids": group["track_ids"][:],
+                    "traj": group["traj"][:],
+                    "traj_scores": group["traj_scores"][:],
+                    "sdc_boxes_3d": LiDARInstance3DBoxes(
+                        group["sdc_boxes_3d"][:], box_dim=9
+                    ),
+                    "sdc_scores_3d": group["sdc_scores_3d"][:],
+                    "planning_traj": group["planning_traj"][:],
+                }
+            )
+
+    return bbox_results
+
+
+def inverse_sigmoid(x, eps=1e-5):
+    """Numerically stable logit."""
+    x = np.clip(x, eps, 1 - eps)
+    return np.log(x) - np.log(1 - x)
+
+
+# Cache for query embedding (load once)
+_query_embedding = None
+
+
+def empty_tracks():
+    global _query_embedding
+
+    track_instances = Instances((1, 1))
+
+    if _query_embedding is None:
+        _query_embedding = np.load(QUERY_EMBEDDING_FILE)
+
+    query = _query_embedding.copy()
+    num_queries, dim = query.shape
+
+    track_instances.query = query
+    track_instances.ref_pts = reference_points(query)
+
+    track_instances.obj_idxes = np.full((len(track_instances)), -1, dtype=np.int64)
+    track_instances.matched_gt_idxes = np.full(
+        (len(track_instances)), -1, dtype=np.int64
+    )
+    track_instances.disappear_time = np.zeros((len(track_instances)), dtype=np.int64)
+
+    track_instances.iou = np.zeros((len(track_instances),), dtype=np.float32)
+    track_instances.scores = np.zeros((len(track_instances),), dtype=np.float32)
+    track_instances.track_scores = np.zeros((len(track_instances),), dtype=np.float32)
+
+    # init boxes: xy, wl, z, h, sin, cos, vx, vy, vz
+    num_box_dims = 10
+    track_instances.pred_boxes = np.zeros(
+        (len(track_instances), num_box_dims), dtype=np.float32
+    )
+
+    num_classes = 10  # number of classes from the dataset config
+    track_instances.pred_logits = np.zeros(
+        (len(track_instances), num_classes), dtype=np.float32
+    )
+
+    mem_bank_len = 4  # typical memory bank length
+    track_instances.mem_bank = np.zeros(
+        (len(track_instances), mem_bank_len, dim // 2), dtype=np.float32
+    )
+    track_instances.mem_padding_mask = np.ones(
+        (len(track_instances), mem_bank_len), dtype=bool
+    )
+    track_instances.save_period = np.zeros((len(track_instances),), dtype=np.float32)
+
+    return track_instances
+
+
+def velo_update(ref_pts, velocity, l2g_r1, l2g_t1, l2g_r2, l2g_t2, time_delta):
+    """
+    Args:
+        ref_pts (np.ndarray): (num_query, 3) in inverse sigmoid space.
+        velocity (np.ndarray): (num_query, 2) m/s in the LiDAR frame.
+    Outs:
+        np.ndarray: updated reference points in inverse sigmoid space.
+    """
+
+    num_query = ref_pts.shape[0]
+    velo_pad = np.concatenate(
+        [velocity, np.zeros((num_query, 1), dtype=velocity.dtype)], axis=-1
+    )
+
+    reference_points = sigmoid(ref_pts)
+    pc_range = np.array([-51.2, -51.2, -5.0, 51.2, 51.2, 3.0], dtype=np.float32)
+    reference_points[..., 0:1] = (
+        reference_points[..., 0:1] * (pc_range[3] - pc_range[0]) + pc_range[0]
+    )
+    reference_points[..., 1:2] = (
+        reference_points[..., 1:2] * (pc_range[4] - pc_range[1]) + pc_range[1]
+    )
+    reference_points[..., 2:3] = (
+        reference_points[..., 2:3] * (pc_range[5] - pc_range[2]) + pc_range[2]
+    )
+    reference_points = reference_points + velo_pad * time_delta
+
+    ref_pts = reference_points @ l2g_r1 + l2g_t1 - l2g_t2
+    g2l_r = np.linalg.inv(l2g_r2).astype(np.float32)
+    ref_pts = ref_pts @ g2l_r
+
+    ref_pts[..., 0:1] = (ref_pts[..., 0:1] - pc_range[0]) / (pc_range[3] - pc_range[0])
+    ref_pts[..., 1:2] = (ref_pts[..., 1:2] - pc_range[1]) / (pc_range[4] - pc_range[1])
+    ref_pts[..., 2:3] = (ref_pts[..., 2:3] - pc_range[2]) / (pc_range[5] - pc_range[2])
+
+    ref_pts = inverse_sigmoid(ref_pts)
+
+    return ref_pts
+
+
+def denormalize_bbox(normalized_bboxes):
+    # rotation
+    rot_sine = normalized_bboxes[..., 6:7]
+
+    rot_cosine = normalized_bboxes[..., 7:8]
+    rot = np.arctan2(rot_sine, rot_cosine)
+
+    # center in the bev
+    cx = normalized_bboxes[..., 0:1]
+    cy = normalized_bboxes[..., 1:2]
+    cz = normalized_bboxes[..., 4:5]
+
+    # size
+    w = normalized_bboxes[..., 2:3]
+    l = normalized_bboxes[..., 3:4]
+    h = normalized_bboxes[..., 5:6]
+
+    w = np.exp(w)
+    l = np.exp(l)
+    h = np.exp(h)
+
+    # velocity
+    vx = normalized_bboxes[:, 8:9]
+    vy = normalized_bboxes[:, 9:10]
+    denormalized_bboxes = np.concatenate([cx, cy, cz, w, l, h, rot, vx, vy], axis=-1)
+
+    return denormalized_bboxes
+
+
+def decode_single(cls_scores, bbox_preds, track_scores, obj_idxes, with_mask=True):
+    """Decode bboxes.
+    Args:
+        cls_scores (Tensor): Outputs from the classification head, \
+            shape [num_query, cls_out_channels]. Note \
+            cls_out_channels should includes background.
+        bbox_preds (Tensor): Outputs from the regression \
+            head with normalized coordinate format (cx, cy, w, l, cz, h, rot_sine, rot_cosine, vx, vy). \
+            Shape [num_query, 9].
+
+    Returns:
+        list[dict]: Decoded boxes.
+    """
+    max_num = 300
+    max_num = min(cls_scores.shape[0], max_num)
+
+    num_classes = 10
+    cls_scores = sigmoid(cls_scores)
+    indexs = np.argmax(cls_scores, axis=-1)
+    labels = indexs % num_classes
+
+    bbox_index = np.argpartition(track_scores, -max_num)[-max_num:]
+    bbox_index = bbox_index[np.argsort(track_scores[bbox_index])[::-1]]
+
+    labels = labels[bbox_index]
+    bbox_preds = bbox_preds[bbox_index]
+    track_scores = track_scores[bbox_index]
+    obj_idxes = obj_idxes[bbox_index]
+
+    scores = track_scores
+
+    final_box_preds = denormalize_bbox(bbox_preds)
+    final_scores = track_scores
+    final_preds = labels
+
+    post_center_range = np.array(
+        [-61.2000, -61.2000, -10.0000, 61.2000, 61.2000, 10.0000]
+    )
+    mask = (final_box_preds[..., :3] >= post_center_range[:3]).all(1)
+    mask &= (final_box_preds[..., :3] <= post_center_range[3:]).all(1)
+    if not with_mask:
+        mask = np.ones_like(mask) > 0
+
+    boxes3d = final_box_preds[mask]
+    scores = final_scores[mask]
+    labels = final_preds[mask]
+    track_scores = track_scores[mask]
+    obj_idxes = obj_idxes[mask]
+    predictions_dict = {
+        "bboxes": boxes3d,
+        "scores": scores,
+        "labels": labels,
+        "track_scores": track_scores,
+        "obj_idxes": obj_idxes,
+        "bbox_index": bbox_index,
+        "mask": mask,
+    }
+
+    return predictions_dict
+
+
+def track_instances2results(track_instances, with_mask=True):
+    all_cls_scores = track_instances.pred_logits
+    all_bbox_preds = track_instances.pred_boxes
+    track_scores = track_instances.scores
+    obj_idxes = track_instances.obj_idxes
+    bboxes_dict = decode_single(
+        all_cls_scores,
+        all_bbox_preds,
+        track_scores,
+        obj_idxes,
+        with_mask,
+    )
+
+    bboxes = bboxes_dict["bboxes"]
+    bboxes = LiDARInstance3DBoxes(bboxes, box_dim=9)
+    labels = bboxes_dict["labels"]
+    scores = bboxes_dict["scores"]
+    bbox_index = bboxes_dict["bbox_index"]
+    track_scores = bboxes_dict["track_scores"]
+    obj_idxes = bboxes_dict["obj_idxes"]
+    result_dict = dict(
+        boxes_3d=bboxes,
+        scores_3d=scores,
+        labels_3d=labels,
+        track_scores=track_scores,
+        bbox_index=bbox_index,
+        track_ids=obj_idxes,
+        mask=bboxes_dict["mask"],
+        track_bbox_results=[
+            [
+                bboxes.clone(),
+                scores.copy(),
+                labels.copy(),
+                bbox_index.copy(),
+                bboxes_dict["mask"].copy(),
+            ]
+        ],
+    )
+    return result_dict
+
+
+def det_instances2results(instances, result_dict):
+    """
+    Outs:
+    active_instances. keys:
+    - 'pred_logits':
+    - 'pred_boxes': normalized bboxes
+    - 'scores'
+    - 'obj_idxes'
+    out_dict. keys:
+        - boxes_3d (torch.Tensor): 3D boxes.
+        - scores (torch.Tensor): Prediction scores.
+        - labels_3d (torch.Tensor): Box labels.
+        - attrs_3d (torch.Tensor, optional): Box attributes.
+        - track_ids
+        - tracking_score
+    """
+    # filter out sleep querys
+    if instances.pred_logits.size == 0:
+        return [None]
+
+    # decode
+    with_mask = True
+    all_cls_scores = instances.pred_logits
+    all_bbox_preds = instances.pred_boxes
+    track_scores = instances.scores
+    obj_idxes = instances.obj_idxes
+    bboxes_dict = decode_single(
+        all_cls_scores,
+        all_bbox_preds,
+        track_scores,
+        obj_idxes,
+        with_mask,
+    )
+
+    bboxes = bboxes_dict["bboxes"]
+    bboxes = LiDARInstance3DBoxes(bboxes, box_dim=9)
+    labels = bboxes_dict["labels"]
+    scores = bboxes_dict["scores"]
+
+    track_scores = bboxes_dict["track_scores"]
+    obj_idxes = bboxes_dict["obj_idxes"]
+    result_dict.update(
+        dict(
+            boxes_3d_det=bboxes,
+            scores_3d_det=scores,
+            labels_3d_det=labels,
+        )
+    )
+
+    return result_dict
+
+
+def get_trajs(preds_dicts, bbox_results):
+    """
+    Generates trajectories from the prediction results, bounding box results.
+    """
+    num_samples = len(bbox_results)
+    num_layers = preds_dicts["all_traj_preds"].shape[0]
+    ret_list = []
+    for i in range(num_samples):
+        preds = dict()
+        for j in range(num_layers):
+            subfix = "_" + str(j) if j < (num_layers - 1) else ""
+            traj = preds_dicts["all_traj_preds"][j, i]
+            traj_scores = preds_dicts["all_traj_scores"][j, i]
+
+            preds["traj" + subfix] = traj
+            preds["traj_scores" + subfix] = traj_scores
+        ret_list.append(preds)
+    return ret_list
+
+
+def to_video(folder_path, out_path, fps=4, downsample=1):
+    imgs_path = glob.glob(os.path.join(folder_path, "*.jpg"))
+    imgs_path = sorted(imgs_path)
+    img_array = []
+    for img_path in imgs_path:
+        img = cv2.imread(img_path)
+        height, width, channel = img.shape
+        img = cv2.resize(
+            img,
+            (width // downsample, height // downsample),
+            interpolation=cv2.INTER_AREA,
+        )
+        height, width, channel = img.shape
+        size = (width, height)
+        img_array.append(img)
+    out = cv2.VideoWriter(out_path, cv2.VideoWriter_fourcc(*"DIVX"), fps, size)
+    for i in range(len(img_array)):
+        out.write(img_array[i])
+    out.release()
+
+
+def visualize_results(
+    results_file, nusc=None, output_dir="vis_output", fps=4, downsample=2
+):
+    """Visualize detection and planning results.
+
+    Args:
+        results_file: Path to HDF5 file containing frame results
+        nusc: NuScenes dataset instance (optional, will be loaded if None)
+        output_dir: Directory to save output images
+        fps: Frames per second for video
+        downsample: Downsample factor for video resolution
+    """
+    # Determine visualization scenes internally
+    defaults_on_none = args.ann_file is None and args.version is None
+    vis_scenes = flatten_args(args.vis_scenes) or (
+        DEFAULT_VIS_SCENES if defaults_on_none else None
+    )
+
+    # Load frame results from HDF5 for visualization
+    logger.info("Loading frame results for visualization...")
+    bbox_results = load_frames_from_hdf5(results_file)
+
+    if len(bbox_results) == 0:
+        logger.info("No results to visualize.")
+        sys.exit(1)
+
+    # Load NuScenes dataset if not provided
+    if nusc is None:
+        from nuscenes import NuScenes
+
+        logger.info("Loading NuScenes dataset...")
+        data_root = args.data_root
+        version = args.version or "v1.0-trainval"
+        nusc = NuScenes(version=version, dataroot=data_root, verbose=True)
+
+    vis = Visualizer(bbox_results=bbox_results, nuscenes=nusc)
+
+    if vis_scenes:
+        logger.info(f"Visualizing scenes: {vis_scenes}")
+    else:
+        logger.info("Visualizing all processed scenes")
+
+    scene_token_to_name = dict()
+    for i in range(len(vis.nusc.scene)):
+        scene_token_to_name[vis.nusc.scene[i]["token"]] = vis.nusc.scene[i]["name"]
+
+    # Clear and create output directory
+    for f in glob.glob(os.path.join(output_dir, "*")):
+        if os.path.isfile(f):
+            os.remove(f)
+    os.makedirs(output_dir, exist_ok=True)
+
+    for i, sample in enumerate(bbox_results):
+        sample_token = sample["token"]
+        scene_token = sample["scene_token"]
+
+        # Get scene name for filtering
+        scene_name = scene_token_to_name[scene_token]
+        # Skip if not in vis_scenes list
+        if vis_scenes and scene_name not in vis_scenes:
+            continue
+
+        with warnings.catch_warnings():
+            warnings.filterwarnings("ignore", category=RuntimeWarning)
+            bev_img = vis.visualize_bev(sample_token)
+            cam_img = vis.visualize_cam(sample_token)
+        out_img = cv2.hconcat([cam_img, bev_img])
+        out_img = out_img[:, :, ::-1]  # RGB to BGR
+
+        savepath = f"{output_dir}/{str(i).zfill(3)}.jpg"
+        cv2.imwrite(savepath, out_img)
+        logger.info(f"saved at : {savepath}")
+
+    savepath = get_savepath(args.savepath, output_dir, ext=".avi")
+    jpg_paths = glob.glob(os.path.join(output_dir, "*.jpg"))
+    if not jpg_paths:
+        logger.info("No frames written; skipping video output.")
+        return
+
+    to_video(output_dir, savepath, fps=fps, downsample=downsample)
+    logger.info(f"saved at : {savepath}")
+
+
+# ======================
+# Main functions
+# ======================
+
+
+def predict(models, img, img_metas, prev_bev=None, track_instances=None):
+    can_bus = img_metas["can_bus"].astype(np.float32)
+    lidar2img = img_metas["lidar2img"].astype(np.float32)
+    img_shape = img_metas["img_shape"]
+    if prev_bev is None:
+        prev_bev = np.zeros((40000, 1, 256), dtype=np.float32)
+
+    # feedforward
+    bev_encoder = models["bev_encoder"]
+    if not args.onnx:
+        output = bev_encoder.predict([img, can_bus, lidar2img, img_shape, prev_bev])
+    else:
+        output = bev_encoder.run(
+            None,
+            {
+                "img": img,
+                "can_bus": can_bus,
+                "lidar2img": lidar2img,
+                "img_shape": img_shape,
+                "prev_bev": prev_bev,
+            },
+        )
+    bev_embed, bev_pos = output
+
+    query = track_instances.query
+    ref_pts = track_instances.ref_pts
+
+    # feedforward
+    track_head = models["track_head"]
+    if not args.onnx:
+        output = track_head.predict([bev_embed, query, ref_pts])
+    else:
+        output = track_head.run(
+            None,
+            {
+                "bev_embed": bev_embed,
+                "query": query,
+                "ref_pts": ref_pts,
+            },
+        )
+    (
+        output_classes,
+        output_coords,
+        last_ref_pts,
+        query_feats,
+        all_past_traj_preds,
+    ) = output
+
+    out = {
+        "pred_logits": output_classes,
+        "pred_boxes": output_coords,
+        "ref_pts": last_ref_pts,
+        "bev_embed": bev_embed,
+        "query_embeddings": query_feats,
+        "all_past_traj_preds": all_past_traj_preds,
+        "bev_pos": bev_pos,
+    }
+
+    return out
+
+
+def memory_bank(models, track_instances):
+    key_padding_mask = track_instances.mem_padding_mask  # [n_, memory_bank_len]
+    valid_idxes = key_padding_mask[:, -1] == 0
+    embed = track_instances.output_embedding[valid_idxes]  # (n, 256)
+
+    if 0 < len(embed):
+        net = models["memory_bank"]
+        if not args.onnx:
+            output = net.predict(
+                [
+                    track_instances.mem_padding_mask,
+                    track_instances.output_embedding,
+                    track_instances.mem_bank,
+                ]
+            )
+        else:
+            output = net.run(
+                None,
+                {
+                    "mem_padding_mask": track_instances.mem_padding_mask,
+                    "output_embedding": track_instances.output_embedding,
+                    "mem_bank": track_instances.mem_bank,
+                },
+            )
+        track_instances.output_embedding = output[0]
+
+    embed = track_instances.output_embedding[:, None]  # ( N, 1, 256)
+    save_period = track_instances.save_period
+    saved_idxes = (save_period == 0) & (track_instances.scores > 0)
+    saved_embed = embed[saved_idxes]
+
+    if 0 < len(saved_embed):
+        net = models["memory_bank_update"]
+        if not args.onnx:
+            output = net.predict(
+                [
+                    track_instances.output_embedding,
+                    track_instances.scores,
+                    track_instances.mem_padding_mask,
+                    track_instances.save_period,
+                    track_instances.mem_bank,
+                ]
+            )
+        else:
+            output = net.run(
+                None,
+                {
+                    "output_embedding": track_instances.output_embedding,
+                    "scores": track_instances.scores,
+                    "mem_padding_mask": track_instances.mem_padding_mask,
+                    "save_period": track_instances.save_period,
+                    "mem_bank": track_instances.mem_bank,
+                },
+            )
+        mem_padding_mask, mem_bank, save_period = output
+        track_instances.mem_padding_mask = mem_padding_mask
+        track_instances.mem_bank = mem_bank
+        track_instances.save_period = save_period
+
+    return track_instances
+
+
+def query_interact(models, track_instances):
+    active_track_instances = track_instances[track_instances.obj_idxes >= 0]
+
+    if 0 < len(active_track_instances):
+        net = models["query_interact"]
+        if not args.onnx:
+            output = net.predict(
+                [active_track_instances.query, active_track_instances.output_embedding]
+            )
+        else:
+            output = net.run(
+                None,
+                {
+                    "query": active_track_instances.query,
+                    "output_embedding": active_track_instances.output_embedding,
+                },
+            )
+        updated_query = output[0]
+        active_track_instances.query = updated_query
+
+    merged_track_instances = Instances.cat([empty_tracks(), active_track_instances])
+
+    return merged_track_instances
+
+
+def seg_head_forward(models, pts_feats):
+    net = models["seg_head"]
+    if not args.onnx:
+        output = net.predict([pts_feats])
+    else:
+        output = net.run(
+            None,
+            {"bev_embed": pts_feats},
+        )
+    (memory, memory_mask, memory_pos, query, query_pos, hw_lvl, hw_lvl2) = output
+
+    out = [
+        memory,
+        memory_mask,
+        memory_pos,
+        query,
+        query_pos,
+        [hw_lvl, hw_lvl2],
+    ]
+    return out
+
+
+def motion_head_forward(models, bev_embed, outs_track: dict, outs_seg: dict):
+    track_query = outs_track["track_query_embeddings"][None, None, ...]
+    track_boxes: LiDARInstance3DBoxes = outs_track["track_bbox_results"]
+
+    track_query = np.concatenate(
+        [track_query, outs_track["sdc_embedding"][None, None, None, :]], axis=2
+    )
+    sdc_track_boxes = outs_track["sdc_track_bbox_results"]
+
+    track_boxes[0][0].tensor = np.concatenate(
+        [track_boxes[0][0].tensor, sdc_track_boxes[0][0].tensor], axis=0
+    )
+    track_boxes[0][1] = np.concatenate(
+        [track_boxes[0][1], sdc_track_boxes[0][1]], axis=0
+    )
+    track_boxes[0][2] = np.concatenate(
+        [track_boxes[0][2], sdc_track_boxes[0][2]], axis=0
+    )
+    track_boxes[0][3] = np.concatenate(
+        [track_boxes[0][3], sdc_track_boxes[0][3]], axis=0
+    )
+    memory, memory_mask, memory_pos, lane_query, lane_query_pos, hw_lvl = outs_seg
+
+    net = models["motion_head"]
+    if not args.onnx:
+        output = net.predict(
+            [
+                bev_embed,
+                track_query,
+                lane_query,
+                lane_query_pos,
+                track_boxes[0][0].tensor,
+                track_boxes[0][2],
+            ]
+        )
+    else:
+        output = net.run(
+            None,
+            {
+                "bev_embed": bev_embed,
+                "track_query": track_query,
+                "lane_query": lane_query,
+                "lane_query_pos": lane_query_pos,
+                "bboxes": track_boxes[0][0].tensor,
+                "bbox_labels": track_boxes[0][2],
+            },
+        )
+    (
+        outputs_traj_scores,
+        outputs_trajs,
+        valid_traj_masks,
+        inter_states,
+        track_query,
+        track_query_pos,
+    ) = output
+
+    outs_motion = {
+        "all_traj_scores": outputs_traj_scores,
+        "all_traj_preds": outputs_trajs,
+        "valid_traj_masks": valid_traj_masks,
+        "traj_query": inter_states,
+        "track_query": track_query,
+        "track_query_pos": track_query_pos,
+    }
+
+    # get_trajs
+    traj_results = get_trajs(outs_motion, track_boxes)
+    bboxes, scores, labels, bbox_index, mask = track_boxes[0]
+    outs_motion["track_scores"] = scores[None, :]
+    labels[-1] = 0
+
+    def filter_vehicle_query(outs_motion, labels, vehicle_id_list):
+        if len(labels) < 1:  # No other obj query except sdc query.
+            return None
+
+        # select vehicle query according to vehicle_id_list
+        vehicle_mask = np.zeros_like(labels)
+        for veh_id in vehicle_id_list:
+            vehicle_mask |= labels == veh_id
+        outs_motion["traj_query"] = outs_motion["traj_query"][:, :, vehicle_mask > 0]
+        outs_motion["track_query"] = outs_motion["track_query"][:, vehicle_mask > 0]
+        outs_motion["track_query_pos"] = outs_motion["track_query_pos"][
+            :, vehicle_mask > 0
+        ]
+        outs_motion["track_scores"] = outs_motion["track_scores"][:, vehicle_mask > 0]
+        return outs_motion
+
+    # NuScenes データセットにおける車両（移動体）カテゴリのクラスID一覧
+    # モーション予測の対象となる動的オブジェクトのみを選択
+    # 0: car (乗用車), 1: truck (トラック), 2: construction_vehicle (建設車両),
+    # 3: bus (バス), 4: trailer (トレーラー), 6: motorcycle (バイク), 7: bicycle (自転車)
+    # 除外: 5: barrier (バリア), 8: pedestrian (歩行者), 9: traffic_cone (三角コーン)
+    vehicle_id_list = [
+        0,  # car
+        1,  # truck
+        2,  # construction_vehicle
+        3,  # bus
+        4,  # trailer
+        6,  # motorcycle
+        7,  # bicycle
+    ]
+    outs_motion = filter_vehicle_query(outs_motion, labels, vehicle_id_list)
+
+    # filter sdc query
+    outs_motion["sdc_traj_query"] = outs_motion["traj_query"][:, :, -1]
+    outs_motion["sdc_track_query"] = outs_motion["track_query"][:, -1]
+    outs_motion["sdc_track_query_pos"] = outs_motion["track_query_pos"][:, -1]
+    outs_motion["traj_query"] = outs_motion["traj_query"][:, :, :-1]
+    outs_motion["track_query"] = outs_motion["track_query"][:, :-1]
+    outs_motion["track_query_pos"] = outs_motion["track_query_pos"][:, :-1]
+    outs_motion["track_scores"] = outs_motion["track_scores"][:, :-1]
+
+    return traj_results, outs_motion
+
+
+def occ_head_forward(models, bev_feat, outs_dict, no_query=False):
+    if no_query:
+        b, t, h, w = 1, 5, 200, 200
+        out_dict = dict()
+        out_dict["seg_out"] = np.zeros(
+            (b, t, 1, h, w), dtype=np.int64
+        )  # [1, 5, 1, 200, 200]
+        out_dict["ins_seg_out"] = np.zeros(
+            (b, t, h, w), dtype=np.int64
+        )  # [1, 5, 200, 200]
+        return out_dict
+
+    traj_query = outs_dict["traj_query"]
+    track_query = outs_dict["track_query"]
+    track_query_pos = outs_dict["track_query_pos"]
+
+    net = models["occ_head"]
+    if not args.onnx:
+        output = net.predict([bev_feat, traj_query, track_query, track_query_pos])
+    else:
+        output = net.run(
+            None,
+            {
+                "bev_feat": bev_feat,
+                "traj_query": traj_query,
+                "track_query": track_query,
+                "track_query_pos": track_query_pos,
+            },
+        )
+    _, pred_ins_logits = output
+
+    out_dict = dict()
+    out_dict["pred_ins_logits"] = pred_ins_logits
+
+    n_future = 4
+    pred_ins_logits = pred_ins_logits[:, :, : 1 + n_future]  # [b, q, t, h, w]
+    pred_ins_sigmoid = sigmoid(pred_ins_logits)  # [b, q, t, h, w]
+
+    # with_track_score
+    track_scores = outs_dict["track_scores"]  # [b, q]
+    track_scores = track_scores[:, :, None, None, None]
+    pred_ins_sigmoid = pred_ins_sigmoid * track_scores  # [b, q, t, h, w]
+
+    out_dict["pred_ins_sigmoid"] = pred_ins_sigmoid
+    pred_seg_scores = np.max(pred_ins_sigmoid, axis=1)
+    test_seg_thresh = 0.1
+    seg_out = (pred_seg_scores > test_seg_thresh).astype(np.int64)
+    seg_out = np.expand_dims(seg_out, axis=2)  # [b, t, 1, h, w]
+    out_dict["seg_out"] = seg_out
+
+    def update_instance_ids(instance_seg, old_ids, new_ids):
+        indices = np.arange(int(old_ids.max()) + 1)
+        for old_id, new_id in zip(old_ids, new_ids):
+            indices[int(old_id)] = int(new_id)
+
+        return indices[instance_seg].astype(np.int64)
+
+    def make_instance_seg_consecutive(instance_seg):
+        # Make the indices of instance_seg consecutive
+        unique_ids = np.unique(instance_seg)  # include background
+        new_ids = np.arange(len(unique_ids))
+        instance_seg = update_instance_ids(instance_seg, unique_ids, new_ids)
+        return instance_seg
+
+    def predict_instance_segmentation_and_trajectories(
+        foreground_masks,
+        ins_sigmoid,
+        vehicles_id=1,
+    ):
+        if foreground_masks.ndim == 5 and foreground_masks.shape[2] == 1:
+            foreground_masks = np.squeeze(foreground_masks, axis=2)  # [b, t, h, w]
+        foreground_masks = (
+            foreground_masks == vehicles_id
+        )  # [b, t, h, w]  Only these places have foreground id
+
+        argmax_ins = np.argmax(
+            ins_sigmoid, axis=1
+        )  # long, [b, t, h, w], ins_id starts from 0
+        argmax_ins = argmax_ins + 1  # [b, t, h, w], ins_id starts from 1
+        instance_seg = (argmax_ins * foreground_masks.astype(np.float32)).astype(
+            np.int64
+        )  # bg is 0, fg starts with 1
+
+        # Make the indices of instance_seg consecutive
+        instance_seg = make_instance_seg_consecutive(instance_seg).astype(np.int64)
+
+        return instance_seg
+
+    # ins_pred
+    pred_consistent_instance_seg = predict_instance_segmentation_and_trajectories(
+        seg_out, pred_ins_sigmoid
+    )  # bg is 0, fg starts with 1, consecutive
+    out_dict["ins_seg_out"] = pred_consistent_instance_seg  # [1, 5, 200, 200]
+
+    return out_dict
+
+
+def planning_head_forward(models, bev_embed, outs_motion, outs_occflow, command):
+    sdc_traj_query = outs_motion["sdc_traj_query"]
+    sdc_track_query = outs_motion["sdc_track_query"]
+    bev_pos = outs_motion["bev_pos"]
+    occ_mask = outs_occflow["seg_out"]
+    command = np.array([command], dtype=np.int64)
+
+    net = models["planning_head"]
+    if not args.onnx:
+        output = net.predict(
+            [bev_embed, bev_pos, sdc_traj_query, sdc_track_query, command]
+        )
+    else:
+        output = net.run(
+            None,
+            {
+                "bev_embed": bev_embed,
+                "bev_pos": bev_pos,
+                "sdc_traj_query": sdc_traj_query,
+                "sdc_track_query": sdc_track_query,
+                "command": command,
+            },
+        )
+    sdc_traj, sdc_traj_all = output
+
+    def collision_optimization(sdc_traj_all, occ_mask):
+        """
+        Optimize SDC trajectory with occupancy instance mask.
+        """
+        pos_xy_t = []
+        valid_occupancy_num = 0
+
+        if occ_mask.shape[2] == 1:
+            occ_mask = np.squeeze(occ_mask, axis=2)
+        occ_horizon = occ_mask.shape[1]
+        assert occ_horizon == 5
+
+        bev_h = bev_w = 200
+        occ_filter_range = 5.0
+        planning_steps = 6
+        sigma = 1.0
+        alpha_collision = 5.0
+        for t in range(planning_steps):
+            cur_t = min(t + 1, occ_horizon - 1)
+            pos_xy = np.argwhere(occ_mask[0][cur_t])
+            pos_xy = pos_xy[:, [1, 0]]
+            pos_xy[:, 0] = (pos_xy[:, 0] - bev_h // 2) * 0.5 + 0.25
+            pos_xy[:, 1] = (pos_xy[:, 1] - bev_w // 2) * 0.5 + 0.25
+
+            # filter the occupancy in range
+            keep_index = (
+                np.sum((sdc_traj_all[0, t, :2][None, :] - pos_xy[:, :2]) ** 2, axis=-1)
+                < occ_filter_range**2
+            )
+            pos_xy_t.append(pos_xy[keep_index])
+            valid_occupancy_num += np.sum(keep_index > 0)
+        if valid_occupancy_num == 0:
+            return sdc_traj_all
+
+        col_optimizer = CollisionNonlinearOptimizer(
+            planning_steps, 0.5, sigma, alpha_collision, pos_xy_t
+        )
+        col_optimizer.set_reference_trajectory(sdc_traj_all[0])
+        sol = col_optimizer.solve()
+        sdc_traj_optim = np.stack(
+            [sol.value(col_optimizer.position_x), sol.value(col_optimizer.position_y)],
+            axis=-1,
+        )
+        return sdc_traj_optim[None].astype(sdc_traj_all.dtype)
+
+    sdc_traj_all = collision_optimization(sdc_traj_all, occ_mask)
+
+    result_planning = {
+        "sdc_traj": sdc_traj_all,
+        "sdc_traj_all": sdc_traj_all,
+    }
+    return result_planning
+
+
+# Cache for reference points weights and bias (loaded once)
+_reference_points_weight = None
+_reference_points_bias = None
+
+
+def reference_points(query):
+    global _reference_points_weight, _reference_points_bias
+
+    if _reference_points_weight is None:
+        _reference_points_weight = np.load(REFERENCE_POINTS_WEIGHT_FILE)
+        _reference_points_bias = np.load(REFERENCE_POINTS_BIAS_FILE)
+
+    num_queries, dim = query.shape
+    ref_pts = (
+        query[..., : dim // 2] @ _reference_points_weight.T + _reference_points_bias
+    )
+    return ref_pts
+
+
+def forward(
+    models,
+    track_base: RuntimeTrackerBase,
+    img,
+    img_metas,
+    track_instances,
+    prev_bev=None,
+    l2g_r1=None,
+    l2g_t1=None,
+    l2g_r2=None,
+    l2g_t2=None,
+    time_delta=None,
+):
+    active_inst = track_instances[track_instances.obj_idxes >= 0]
+    other_inst = track_instances[track_instances.obj_idxes < 0]
+
+    if l2g_r2 is not None and len(active_inst) > 0 and l2g_r1 is not None:
+        ref_pts = active_inst.ref_pts
+        velo = active_inst.pred_boxes[:, -2:]
+        ref_pts = velo_update(
+            ref_pts, velo, l2g_r1, l2g_t1, l2g_r2, l2g_t2, time_delta=time_delta
+        )
+        ref_pts = np.squeeze(ref_pts, axis=0)
+        active_inst.ref_pts = reference_points(active_inst.query)
+        active_inst.ref_pts[..., :2] = ref_pts[..., :2]
+
+    track_instances = Instances.cat([other_inst, active_inst])
+
+    output = predict(
+        models, img, img_metas, prev_bev=prev_bev, track_instances=track_instances
+    )
+    bev_embed = output["bev_embed"]
+    output_classes = output["pred_logits"]
+    output_coords = output["pred_boxes"]
+    last_ref_pts = output["ref_pts"]
+    query_feats = output["query_embeddings"]
+    all_past_traj_preds = output["all_past_traj_preds"]
+    bev_pos = output["bev_pos"]
+
+    item = {
+        "pred_logits": output_classes,
+        "pred_boxes": output_coords,
+        "ref_pts": last_ref_pts,
+        "bev_embed": bev_embed,
+        "query_embeddings": query_feats,
+        "all_past_traj_preds": all_past_traj_preds,
+        "bev_pos": bev_pos,
+    }
+
+    # Apply sigmoid and get max values along last axis
+    logits = output_classes[-1, 0, :]
+    sigmoid_logits = sigmoid(logits)
+    track_scores = np.max(sigmoid_logits, axis=-1)
+
+    # each track will be assigned an unique global id by the track base.
+    track_instances.scores = track_scores
+    # track_instances.track_scores = track_scores  # [300]
+    track_instances.pred_logits = output_classes[-1, 0]  # [300, num_cls]
+    track_instances.pred_boxes = output_coords[-1, 0]  # [300, box_dim]
+    track_instances.output_embedding = query_feats[-1][0]  # [300, feat_dim]
+    track_instances.ref_pts = last_ref_pts[0]
+    # hard_code: assume the 901 query is sdc query
+    track_instances.obj_idxes[900] = -2
+
+    """ update track base """
+    track_base.update(track_instances)
+
+    filter_score_thresh = 0.35
+    active_index = (track_instances.obj_idxes >= 0) & (
+        track_instances.scores >= filter_score_thresh
+    )  # filter out sleep objects
+    # select_active_track_query
+    result_dict = track_instances2results(track_instances[active_index], with_mask=True)
+    result_dict["track_query_embeddings"] = track_instances.output_embedding[
+        active_index
+    ][result_dict["bbox_index"]][result_dict["mask"]]
+    result_dict["track_query_matched_idxes"] = track_instances.matched_gt_idxes[
+        active_index
+    ][result_dict["bbox_index"]][result_dict["mask"]]
+    item.update(result_dict)
+    # select_sdc_track_query
+    sdc_instance = track_instances[track_instances.obj_idxes == -2]
+    result_dict = track_instances2results(sdc_instance, with_mask=False)
+    item.update(
+        dict(
+            sdc_boxes_3d=result_dict["boxes_3d"],
+            sdc_scores_3d=result_dict["scores_3d"],
+            sdc_track_scores=result_dict["track_scores"],
+            sdc_track_bbox_results=result_dict["track_bbox_results"],
+            sdc_embedding=sdc_instance.output_embedding[0],
+        )
+    )
+
+    """ update with memory_bank """
+    track_instances = memory_bank(models, track_instances)
+
+    """  Update track instances using matcher """
+    item["track_instances_fordet"] = track_instances
+    item["track_instances"] = track_instances = query_interact(models, track_instances)
+
+    return item
+
+
+def process_all_frames(models, data_loader, track_base, h5f):
+    """Process all frames from data_loader and stream results to HDF5.
+
+    Args:
+        models: Dict of initialized model sessions
+        data_loader: Iterable yielding dataset items
+        track_base: Tracker base managing IDs and states
+        h5f: Open HDF5 file handle to write per-frame results
+    """
+
+    track_instances = None
+    tiemestamp = None
+    scene_token = None
+    prev_bev = None
+    prev_frame_info = {
+        "prev_bev": None,
+        "scene_token": None,
+        "prev_pos": 0,
+        "prev_angle": 0,
+    }
+
+    for i, data in tqdm.tqdm(
+        enumerate(data_loader), total=len(data_loader), desc="Processing frames"
+    ):
+        data = data[0]
+        img = data["img"]
+        img_metas = data["img_metas"]
+        l2g_t = data["l2g_t"]
+        l2g_r_mat = data["l2g_r_mat"]
+        _timestamp = data["timestamp"]
+
+        img = np.stack(img).transpose(0, 3, 1, 2)  # N, C, H, W
+        img = np.expand_dims(img, axis=0)  # B, N, C, H, W
+
+        if img_metas["scene_token"] != prev_frame_info["scene_token"]:
+            # the first sample of each scene is truncated
+            prev_frame_info["prev_bev"] = None
+        # update idx
+        prev_frame_info["scene_token"] = img_metas["scene_token"]
+
+        # Get the delta of ego position and angle between two timestamps.
+        tmp_pos = copy.deepcopy(img_metas["can_bus"][:3])
+        tmp_angle = copy.deepcopy(img_metas["can_bus"][-1])
+        # first frame
+        if prev_frame_info["scene_token"] is None:
+            img_metas["can_bus"][:3] = 0
+            img_metas["can_bus"][-1] = 0
+        # following frames
+        else:
+            img_metas["can_bus"][:3] -= prev_frame_info["prev_pos"]
+            img_metas["can_bus"][-1] -= prev_frame_info["prev_angle"]
+        prev_frame_info["prev_pos"] = tmp_pos
+        prev_frame_info["prev_angle"] = tmp_angle
+
+        ### simple_test_track logic ###
+
+        if track_instances is None or img_metas["scene_token"] != scene_token:
+            scene_token = img_metas["scene_token"]
+            prev_bev = None
+            track_instances = empty_tracks()
+            time_delta, l2g_r1, l2g_t1, l2g_r2, l2g_t2 = None, None, None, None, None
+        else:
+            time_delta = _timestamp - tiemestamp
+            l2g_r1 = l2g_r2
+            l2g_t1 = l2g_t2
+            l2g_r2 = l2g_r_mat[None, ...]
+            l2g_t2 = l2g_t[None, ...]
+
+        frame_res = forward(
+            models,
+            track_base,
+            img,
+            img_metas,
+            track_instances,
+            prev_bev,
+            l2g_r1,
+            l2g_t1,
+            l2g_r2,
+            l2g_t2,
+            time_delta,
+        )
+
+        """ get time_delta and l2g r/t infos """
+        """ update frame info for next frame"""
+        tiemestamp = _timestamp
+        l2g_r2 = l2g_r_mat[None, ...]
+        l2g_t2 = l2g_t[None, ...]
+
+        bev_embed = prev_bev = frame_res["bev_embed"]
+        track_instances = frame_res["track_instances"]
+        track_instances_fordet = frame_res["track_instances_fordet"]
+
+        result_dict = dict(
+            bev_embed=bev_embed,
+            bev_pos=frame_res["bev_pos"],
+            track_query_embeddings=frame_res["track_query_embeddings"],
+            track_bbox_results=frame_res["track_bbox_results"],
+            boxes_3d=frame_res["boxes_3d"],
+            scores_3d=frame_res["scores_3d"],
+            labels_3d=frame_res["labels_3d"],
+            track_scores=frame_res["track_scores"],
+            track_ids=frame_res["track_ids"],
+            sdc_boxes_3d=frame_res["sdc_boxes_3d"],
+            sdc_scores_3d=frame_res["sdc_scores_3d"],
+            sdc_track_scores=frame_res["sdc_track_scores"],
+            sdc_track_bbox_results=frame_res["sdc_track_bbox_results"],
+            sdc_embedding=frame_res["sdc_embedding"],
+        )
+
+        result_track = det_instances2results(track_instances_fordet, result_dict)
+
+        ### End of simple_test_track ###
+
+        # seg_head
+        result_seg = seg_head_forward(models, bev_embed)
+        # motion_head
+        result_motion, outs_motion = motion_head_forward(
+            models, bev_embed, outs_track=result_track, outs_seg=result_seg
+        )
+        outs_motion["bev_pos"] = result_track["bev_pos"]
+
+        result = dict()
+        result["token"] = img_metas["sample_idx"]
+        result["scene_token"] = img_metas["scene_token"]
+
+        occ_no_query = outs_motion["track_query"].shape[1] == 0
+        outs_occ = occ_head_forward(
+            models,
+            bev_embed,
+            outs_motion,
+            no_query=occ_no_query,
+        )
+        result["occ"] = outs_occ
+
+        command = data["command"]
+        planning_gt = dict(command=command)
+        result_planning = planning_head_forward(
+            models, bev_embed, outs_motion, outs_occ, command
+        )
+        result["planning"] = dict(
+            planning_gt=planning_gt,
+            result_planning=result_planning,
+        )
+        result.update(result_track)
+        result.update(
+            result_motion[0]
+        )  # 'traj_0', 'traj_scores_0', 'traj_1', 'traj_scores_1', 'traj', 'traj_scores'
+
+        ### End of forward_test ###
+
+        result["planning_traj"] = result["planning"]["result_planning"]["sdc_traj"]
+        result["command"] = result["planning"]["planning_gt"]["command"]
+
+        # Stream frame results to HDF5 file
+        save_frame_to_hdf5(h5f, result, i)
+
+        # Clear result from memory
+        del result
+
+
+def recognize_from_image(models):
+    defaults_on_none = args.ann_file is None and args.version is None
+
+    ann_file = args.ann_file
+    data_root = args.data_root
+    version = args.version or "v1.0-trainval"
+    test_scenes = flatten_args(args.scenes) or (
+        DEFAULT_SCENES if defaults_on_none else None
+    )
+
+    if ann_file:
+        logger.info(f"Using annotation file: {ann_file}")
+    else:
+        logger.info(f"Data root: {data_root}")
+        logger.info(f"Dataset version: {version}")
+    logger.info(f"Processing test scenes: {test_scenes or 'ALL'}")
+
+    dataset = NuScenesDataset(
+        data_root=data_root,
+        version=version,
+        ann_file=ann_file,
+        test_scenes=test_scenes,
+    )
+
+    if len(dataset) == 0 and not defaults_on_none:
+        logger.info(
+            "No data found in the dataset. Please check if the scene name is correct."
+        )
+        sys.exit(1)
+
+    # Simple DataLoader replacement
+    class SimpleDataLoader:
+        def __init__(self, dataset):
+            self.dataset = dataset
+
+        def __iter__(self):
+            for i in range(len(self.dataset)):
+                yield [self.dataset[i]]
+
+        def __len__(self):
+            return len(self.dataset)
+
+    data_loader = SimpleDataLoader(dataset)
+
+    # Determine results file path
+    if args.visualize:
+        # Visualize mode: skip inference, use provided HDF5 file
+        results_file = args.visualize
+    else:
+        # Inference mode: run process_all_frames and save to results_file
+        results_file = args.results_file
+        with h5py.File(results_file, "w") as h5f:
+            logger.info(f"All frames streamed to: {results_file}")
+
+            score_thresh = 0.4
+            filter_score_thresh = 0.35
+            miss_tolerance = 5
+            track_base = RuntimeTrackerBase(
+                score_thresh=score_thresh,
+                filter_score_thresh=filter_score_thresh,
+                miss_tolerance=miss_tolerance,
+            )
+            process_all_frames(models, data_loader, track_base, h5f)
+
+    # Visualize results
+    visualize_results(
+        results_file=results_file, nusc=dataset.nusc, output_dir="vis_output"
+    )
+
+    logger.info("Script finished successfully.")
+
+
+def main():
+    # Skip model initialization if only visualizing
+    if args.visualize:
+        visualize_results(args.visualize)
+        return
+
+    # model files check and download
+    check_and_download_models(WEIGHT_BEV_ENC_PATH, MODEL_BEV_ENC_PATH, REMOTE_PATH)
+    check_and_download_models(
+        WEIGHT_TRACK_HEAD_PATH, MODEL_TRACK_HEAD_PATH, REMOTE_PATH
+    )
+    check_and_download_models(
+        WEIGHT_MEMORY_BANK_PATH, MODEL_MEMORY_BANK_PATH, REMOTE_PATH
+    )
+    check_and_download_models(
+        WEIGHT_MEMORY_BANK_UPD_PATH, MODEL_MEMORY_BANK_UPD_PATH, REMOTE_PATH
+    )
+    check_and_download_models(
+        WEIGHT_QUERY_INTERACTION_PATH, MODEL_QUERY_INTERACTION_PATH, REMOTE_PATH
+    )
+    check_and_download_models(WEIGHT_SEG_HEAD_PATH, MODEL_SEG_HEAD_PATH, REMOTE_PATH)
+    check_and_download_models(
+        WEIGHT_MOTION_HEAD_PATH, MODEL_MOTION_HEAD_PATH, REMOTE_PATH
+    )
+    check_and_download_models(WEIGHT_OCC_HEAD_PATH, MODEL_OCC_HEAD_PATH, REMOTE_PATH)
+    check_and_download_models(
+        WEIGHT_PLANNING_HEAD_PATH, MODEL_PLANNING_HEAD_PATH, REMOTE_PATH
+    )
+    check_and_download_file(QUERY_EMBEDDING_FILE, REMOTE_PATH)
+    check_and_download_file(REFERENCE_POINTS_WEIGHT_FILE, REMOTE_PATH)
+    check_and_download_file(REFERENCE_POINTS_BIAS_FILE, REMOTE_PATH)
+
+    env_id = args.env_id
+
+    # initialize
+    if not args.onnx:
+        bev_encoder = ailia.Net(MODEL_BEV_ENC_PATH, WEIGHT_BEV_ENC_PATH, env_id=env_id)
+        track_head = ailia.Net(
+            MODEL_TRACK_HEAD_PATH, WEIGHT_TRACK_HEAD_PATH, env_id=env_id
+        )
+        memory_bank = ailia.Net(
+            MODEL_MEMORY_BANK_PATH, WEIGHT_MEMORY_BANK_PATH, env_id=env_id
+        )
+        memory_bank_update = ailia.Net(
+            MODEL_MEMORY_BANK_UPD_PATH, WEIGHT_MEMORY_BANK_UPD_PATH, env_id=env_id
+        )
+        query_interact = ailia.Net(
+            MODEL_QUERY_INTERACTION_PATH, WEIGHT_QUERY_INTERACTION_PATH, env_id=env_id
+        )
+        seg_head = ailia.Net(MODEL_SEG_HEAD_PATH, WEIGHT_SEG_HEAD_PATH, env_id=env_id)
+        motion_head = ailia.Net(
+            MODEL_MOTION_HEAD_PATH, WEIGHT_MOTION_HEAD_PATH, env_id=env_id
+        )
+        occ_head = ailia.Net(MODEL_OCC_HEAD_PATH, WEIGHT_OCC_HEAD_PATH, env_id=env_id)
+        planning_head = ailia.Net(
+            MODEL_PLANNING_HEAD_PATH, WEIGHT_PLANNING_HEAD_PATH, env_id=env_id
+        )
+    else:
+        import onnxruntime
+
+        providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
+        bev_encoder = onnxruntime.InferenceSession(
+            WEIGHT_BEV_ENC_PATH, providers=providers
+        )
+        track_head = onnxruntime.InferenceSession(
+            WEIGHT_TRACK_HEAD_PATH, providers=providers
+        )
+        memory_bank = onnxruntime.InferenceSession(
+            WEIGHT_MEMORY_BANK_PATH, providers=providers
+        )
+        memory_bank_update = onnxruntime.InferenceSession(
+            WEIGHT_MEMORY_BANK_UPD_PATH, providers=providers
+        )
+        query_interact = onnxruntime.InferenceSession(
+            WEIGHT_QUERY_INTERACTION_PATH, providers=providers
+        )
+        seg_head = onnxruntime.InferenceSession(
+            WEIGHT_SEG_HEAD_PATH, providers=providers
+        )
+        motion_head = onnxruntime.InferenceSession(
+            WEIGHT_MOTION_HEAD_PATH, providers=providers
+        )
+        occ_head = onnxruntime.InferenceSession(
+            WEIGHT_OCC_HEAD_PATH, providers=providers
+        )
+        planning_head = onnxruntime.InferenceSession(
+            WEIGHT_PLANNING_HEAD_PATH, providers=providers
+        )
+
+    models = dict(
+        bev_encoder=bev_encoder,
+        track_head=track_head,
+        memory_bank=memory_bank,
+        memory_bank_update=memory_bank_update,
+        query_interact=query_interact,
+        seg_head=seg_head,
+        motion_head=motion_head,
+        occ_head=occ_head,
+        planning_head=planning_head,
+    )
+    recognize_from_image(models)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/autonomous_driving/uniad/vis_output/.gitignore b/autonomous_driving/uniad/vis_output/.gitignore
new file mode 100644
index 000000000..4bc404475
--- /dev/null
+++ b/autonomous_driving/uniad/vis_output/.gitignore
@@ -0,0 +1 @@
+*.jpg
\ No newline at end of file
diff --git a/scripts/download_all_models.sh b/scripts/download_all_models.sh
index 8cb2c88b6..28189a34e 100755
--- a/scripts/download_all_models.sh
+++ b/scripts/download_all_models.sh
@@ -47,6 +47,7 @@ cd ../../audio_processing/pytorch_wavenet/; python3 pytorch_wavenet.py ${OPTION}
 cd ../../audio_processing/audiosep/; python3 audiosep.py ${OPTION}
 cd ../../audio_processing/cosyvoice2/; python3 cosyvoice2.py ${OPTION}
 cd ../../audio_processing/sensevoice/; python3 sensevoice.py ${OPTION}
+#cd ../../autonomous_driving/uniad/; python3 uniad.py ${OPTION}
 cd ../../background_ramoval/deep-image-matting; python3 deep-image-matting.py ${OPTION}
 cd ../../background_ramoval/u2net; python3 u2net.py ${OPTION}
 cd ../../background_ramoval/u2net; python3 u2net.py -a small ${OPTION}