Skip to content

Feat: Unified Scene Data Flow#56

Open
WCJ-BERT wants to merge 17 commits intoNVlabs:mainfrom
WCJ-BERT:adapt_to_new_runtime
Open

Feat: Unified Scene Data Flow#56
WCJ-BERT wants to merge 17 commits intoNVlabs:mainfrom
WCJ-BERT:adapt_to_new_runtime

Conversation

@WCJ-BERT
Copy link
Copy Markdown

Summary

This PR migrates AlpaSim runtime from USDZ artifact-based scene loading to trajdata UnifiedDataset. It introduces a unified interface for loading scenes from multiple autonomous driving datasets (e.g., USDZ, NuPlan, Waymo) and improves memory efficiency through lazy loading.

Key Changes

Breaking Changes

  • RuntimeContext no longer stores full SceneDataSource objects; it now uses a lightweight scene_id_to_idx mapping.
  • DaemonEngine now loads scenes on demand in each worker.
  • build_pending_jobs_from_request() now takes a callable instead of a dict.

New Functionality

  • Added a SceneDataSource protocol to abstract scene data access.
  • Added TrajdataDataSource for direct trajdata-based scene loading.
  • Added a prepare_data CLI tool for preprocessing trajdata caches.
  • Enabled lazy scene loading in workers to reduce memory usage.

Code and Documentation Updates

  • Updated existing tests for the new data flow and added integration tests for trajdata functionality.
  • Added data preparation instructions to README and TUTORIAL.md.
  • Documented the trajdata-alpasim dependency in ONBOARDING.
  • Updated wizard config defaults for data_source.

Copy link
Copy Markdown
Collaborator

@mwatson-nvidia mwatson-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission!

I have not tested this internally, and that is something that we'll want to do, I think, before proceeding with this PR. However, before I do this, I think it's worth addressing/discussing this first round of comments.

Comment thread src/utils/alpasim_utils/trajdata_data_source.py Outdated
Comment thread src/wizard/configs/base_config.yaml Outdated
Comment thread src/wizard/configs/base_config.yaml Outdated
# Optional: Base path for MTGS rendering assets
asset_base_path: null

# YAML config mode parameters (for NuPlan central token processing)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this data structure is pretty flat, meaning that, as we add additional data source types, this may grow unwieldy and it will be hard for readers to understand which fields are needed for which data sources and which fields can affect which types of data sources. I wonder if it makes sense to try and switch this to a more hierarchical structure now. Do you have any thoughts on this?

Copy link
Copy Markdown
Author

@WCJ-BERT WCJ-BERT Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this structure?

runtime:
   data_source:
     cache_location: "/path/to/cache"
     incl_vector_map: true
     rebuild_cache: false
     rebuild_maps: false
     num_workers: 4

     sources:
       usdz:
         enabled: true
         data_dir: "/path/to/usdz"
         desired_dt: 0.1
         extra_params:
              smooth_trajectories: true

       nuplan:
         enabled: false
         data_dir: "/path/to/nuplan"
         extra_params:
               config_dir: "/path/to/yaml"
               asset_base_path: null 
               num_timesteps_before: 30
               num_timesteps_after: 80
 
       waymo:
         enabled: false
         data_dir: "/path/to/waymo"

Also, I introduce a GenericSourceConfig to support hierarchical structure in config.py.

@dataclass
class GenericSourceConfig:
    """Generic configuration for any trajdata-supported dataset.

    This unified config supports all trajdata datasets (USDZ, NuPlan, Waymo,
    nuScenes, Lyft, Argoverse, etc.) with a flexible extra_params field for
    dataset-specific options.

    Attributes:
        enabled: Whether this data source is enabled
        data_dir: Path to dataset directory
        desired_dt: Desired time delta between trajectory frames in seconds
        incl_vector_map: Whether to load vector maps (roads, lanes, etc.)
        extra_params: Dataset-specific parameters (e.g., NuPlan's config_dir,
                      USDZ's asset_base_path, etc.)

    Example extra_params:
        - NuPlan: {"config_dir": "/path", "num_timesteps_before": 30, "num_timesteps_after": 80}
        - USDZ: {"asset_base_path": "/assets"}
        - Waymo: {} (no extra params needed)
    """

    enabled: bool = True
    data_dir: str = MISSING
    desired_dt: float = 0.1  # 10 Hz sampling
    extra_params: Dict[str, Any] = field(default_factory=dict)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is closer but still seems to have the issue where there is a bunch of information that might be present but not used/misleading. What are your thoughts on something closer to:

  @dataclass                                                                                                                                                                                  
  class DataSourceConfig:  
      cache_location: str = MISSING
      desired_dt: float = 0.1
      incl_vector_map: bool = True                                                                                                                                                            
      rebuild_cache: bool = False
      rebuild_maps: bool = False                                                                                                                                                              
      num_workers: int = 1 
                                                                                                                                                                                              
      # Only one active at a time (or a list of these)                                                                                                                                        
      usdz: Optional[UsdzSourceConfig] = None                                                                                                                                                 
      nuplan: Optional[NuPlanSourceConfig] = None                                                                                                                                             
      waymo: Optional[WaymoSourceConfig] = None

Copy link
Copy Markdown
Author

@WCJ-BERT WCJ-BERT Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've carefully considered both approaches and implemented the explicit field structure as requested. However, after implementation, I'd like to discuss which approach better serves our long-term goals.

Two Approaches Comparison

I've explored both structures in detail. Here's what I found:

Approach 1: Explicit Fields

config.py

@dataclass
class USDZSourceConfig:
    data_dir: str = MISSING
    extra_params: Dict[str, Any] = field(default_factory=dict)

@dataclass
class NuPlanSourceConfig:
    data_dir: str = MISSING
    extra_params: Dict[str, Any] = field(default_factory=dict)

@dataclass
class DataSourceConfig:
    cache_location: str = MISSING
    desired_dt: float = 0.1
    incl_vector_map: bool = True
    rebuild_cache: bool = False
    rebuild_maps: bool = False
    num_workers: int = 1

    # Only one active at a time (or a list of these)
    usdz: Optional[USDZSourceConfig] = None
    nuplan: Optional[NuPlanSourceConfig] = None
    waymo: Optional[WaymoSourceConfig] = None

YAML Config:

data_source:
  cache_location: "/path/to/cache"
  desired_dt: 0.1
  incl_vector_map: true
  
  nuplan:
    data_dir: "/path/to/nuplan"
    extra_params:
      config_dir: "/path/to/configs"
      num_timesteps_before: 30
      num_timesteps_after: 80

Approach 2: Generic Sources (Current Implementation)

config.py:

@dataclass
class GenericSourceConfig:
    data_dir: str = MISSING
    extra_params: Dict[str, Any] = field(default_factory=dict)

@dataclass
class DataSourceConfig:
    cache_location: str = MISSING
    desired_dt: float = 0.1
    incl_vector_map: bool = True
    rebuild_cache: bool = False
    rebuild_maps: bool = False
    num_workers: int = 1
    
    # Extensible: supports any trajdata dataset
    sources: Dict[str, GenericSourceConfig] = field(default_factory=dict)

YAML Config:

data_source:
  cache_location: "/path/to/cache"
  desired_dt: 0.1
  incl_vector_map: true
  
  sources:
    nuplan:
      data_dir: "/path/to/nuplan"
      extra_params:
        config_dir: "/path/to/configs"
        num_timesteps_before: 30
        num_timesteps_after: 80
    
    nuscenes:  # Easy to add without code changes
      data_dir: "/path/to/nuscenes"

My Recommendation

I believe Approach 2 better aligns with our architecture goals:

  1. trajdata's UnifiedDataset is designed for extensibility - it accepts arbitrary dataset names and parameters through its API. Our config should mirror this flexibility.
  2. Maintenance burden - With explicit fields, every new dataset or trajdata API change requires coordinated updates across config.py, to_trajdata_params(), and scene_loader.py.
  3. Real-world usage - In practice, all dataset-specific parameters go in extra_params anyway, so we don't actually gain type safety from separate Config classes.
  4. Proven pattern - The original design successfully supported multiple datasets and has been working well.

What do you think? I'm happy to keep the explicit field approach if you feel the type distinction is important, but I wanted to share these considerations.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with you about approach 2 being a bit better. I'd say go ahead and implement that

Comment thread src/runtime/alpasim_runtime/unbound_rollout.py Outdated
Comment thread src/wizard/configs/base_config.yaml Outdated
Comment thread docs/TUTORIAL.md Outdated
Comment thread src/runtime/alpasim_runtime/daemon/engine.py Outdated
@@ -0,0 +1,720 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerns from Claude:

Several concerns:                                                                                                                                                                           
                                                                                                                                                                                              
  1. load_config_from_file unpacks a dataclass into a dict, then everything uses config["key"] / config.get("key", default). This throws away the type safety from DataSourceConfig. The      
  function should just return the DataSourceConfig (or UserSimulatorConfig) and the callers should use it directly. The dict intermediary means defaults are re-specified in multiple places  
  (e.g., num_workers defaults to 8 on lines 653 and 710, but 1 in the dataclass).                                                                                                             
                                                                                                                                                                                              
  2. Inconsistent defaults across paths. desired_dt defaults to 0.1 via CLI (line 415), 0.5 in preprocess_from_yaml_configs (line 164) and run_yaml_batch_preprocessing (line 555), and 0.1 in
   DataSourceConfig. num_workers defaults to 1 in the dataclass, 1 in CLI, but 8 in the config.get() calls (lines 653, 710). These will silently produce different behavior depending on which
   code path you take.                                                                                                                                                                        
                                                                     
  3. Commented-out code. Lines 135-141 have a commented-out "extract every central_token" block — this is dead code that should either be implemented or removed.                             
   
  4. run_yaml_batch_preprocessing is a trivial wrapper. It just calls preprocess_from_yaml_configs and converts bool → int. It doesn't justify being a separate function.                     
                                                                     
  5. Deferred import in load_config_from_file (line 487). The codebase convention is imports at the top of the file.                                                                          
                                                                     
  6. --verbose defaults to True with store_true. This means --verbose is always true unless --quiet is passed. The flag is a no-op — you can never not be verbose except via --quiet.         
  Confusing UX.                                                      
                                                                                                                                                                                              
  7. The --smooth-trajectories flag accepts string choices ["true", "false", "True", "False"] then parses them manually. A BooleanOptionalAction or simply --smooth-trajectories /            
  --no-smooth-trajectories would be cleaner.
                                                                                                                                                                                              
  8. The whole file is ~720 lines for what amounts to "call UnifiedDataset(...) with the right args." The real work is in trajdata. Most of the complexity here is plumbing config between two
   input formats (CLI args vs. YAML) and two modes (basic vs. YAML batch), all through a lossy dict intermediate. Simplifying the config flow (just use the dataclass end-to-end) would cut
  this significantly. 

# artifacts = {data_source.scene_id: data_source}
"""

from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude comments:

  1. The map property is a 250-line monster (lines 670-999). It has two completely different code paths (scene.map_data vs. dataset._map_api) with duplicated transformation logic, followed  
  by ~100 lines of post-transform verification and data-type fixups. This should be broken into separate methods. The transform logic itself (transform_map_points) is defined as a nested    
  closure, which makes it harder to test.                                                                                                                                                     
                                                                                                                                                                                              
  2. Excessive defensive hasattr checks everywhere. The code is littered with hasattr(lane, "center"), hasattr(lane.center, "points"), hasattr(agent.extent, "length"), etc. If these are     
  typed trajdata objects, these checks shouldn't be necessary. If the schema is genuinely unstable, that's a bigger problem. This reads like the author wasn't sure what the trajdata API
  actually guarantees.                               

[mwatson]: this might actually be expected--I recall that trajdata has a lot of optional stuff. At the same time, we might do better to assume we have what we need and catch exceptions when fields don't exist? I'd like to hear your thoughts on this                                                                                                                                        
                                                                     
  3. The _extract_agent_trajectory state accessor pattern (lines 305-330) is wild. It does state.get_attr("x") if hasattr(state, "get_attr") else state.x, then checks if the result is       
  scalar, wraps it in an array, then immediately unwraps it with float(x[0] if x.ndim > 0 else x). This suggests uncertainty about what get_raw_state returns — it should be pinned down once
  and handled simply.                                                                                                                                                                         
                                                                     
  4. n/a                                                                                                                              
                                                                     
  5. Coordinate transformation is copy-pasted three times. Ego trajectory (line 424), traffic objects (line 609), and map (lines 805-838) all apply positions + translation. This should be a 
  single utility function.
                                                                                                                                                                                              
  6. from_agent_batch is dead code. It calls _load_from_batch which immediately raises NotImplementedError. This should be removed — it's not a stub, it's misleading.                        
  
  7. The smoothing code in traffic_objects (lines 618-642) has a deferred import csaps inside a loop body. This violates the codebase convention (imports at top of file), and the import     
  happens once per traffic object if smoothing is enabled.           
                                                                                                                                                                                              
  8. metadata generates random UUIDs and datetime.now() (lines 1038-1039). This makes it non-deterministic — running the same scene twice produces different metadata. dataset_hash being a   
  random UUID defeats the purpose of a hash.
                                                                                                                                                                                              
  9. Implicit ordering dependency. traffic_objects silently forces self.rig to be loaded first (line 605) to get world_to_nre. map does the same (lines 691-693, 766-768). This coupling is   
  hidden — if someone accesses map or traffic_objects before rig, it triggers a chain of lazy loading with side effects. The dependency should be explicit (e.g., require world_to_nre as a
  constructor argument, or compute it once eagerly).                                                                                                                                          
                                                                     
  10. _scene_id as a mutable dataclass field with a property setter (lines 82-187) is odd for what should be an immutable identifier. The scene_id property has three resolution paths (the   
  field, _scene.name, or raise). This complexity isn't justified.




@runtime_checkable
class SceneDataSource(Protocol):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there should probably be a tie between this and the existing Artifact. For instance, should we indicate that Artifact implements this protocol, or just keep with the current duck typed approach?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question about the relationship between Artifact and SceneDataSource!

Design Intent:
Our goal is to unify all data sources (USDZ, NuPlan, Waymo) through trajdata, using the SceneDataSource protocol as the user-facing interface.

Current Architecture:

  • SceneDataSource - Protocol defining the unified interface for Runtime
  • TrajdataDataSource - Implements SceneDataSource, provides unified access to all data formats
  • Artifact - Internal utility for reading USDZ files, used by trajdata during USDZ data conversion

Why Artifact Still Exists:
Artifact is still needed as an internal component. When trajdata processes USDZ data, it uses Artifact under the hood to read the USDZ file format.

In terms of protocols:

  • Artifact and SceneDataSource operate at different abstraction levels
  • Artifact is an implementation detail (USDZ file reader)
  • SceneDataSource is the public interface (unified data access)

What do you think of this design?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a more generic interface/protocol with different backends. However, Artifact already structurally satisfies the SceneDataSource protocol. So it feels like a better approach might be to make that relationship explicit and have Artifact formally implement SceneDataSource. Then, we could look at the existing consumers of Artifact and see if they can be switched over to SceneDataSource so that we could more easily reuse the code for the new sources.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I tried running this PR locally and got a string of errors running the basic tutorial uv run alpasim_wizard deploy=local topology=1gpu driver=vavam wizard.log_dir=$PWD/tutorial . Can you try running it locally and see if it works for you?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing locally. The root cause was a mismatch between base_config.yaml and the DataSourceConfig schema — usdz/nuplan were direct keys under data_source, but the dataclass expected them under sources. This caused OmegaConf to fail at config parse time before anything else ran. I've already fixed these bugs.

One note on trajdata: the project currently pins trajdata-alpasim from the alpasim branch on GitHub. I've made some improvements to that branch (including USDZ support enhancements) and have a separate PR open for it. Running this PR requires that version of trajdata — I'll make sure to link the trajdata PR here so they can be reviewed together. NVlabs/trajdata#62

The lastest version works in my local.
Prepare Data:
Screenshot 2026-04-24 at 19 32 38
Test Runtime:
image

Would appreciate it if you could give it a spin on your end. (◕‿◕)

WCJ-BERT added a commit to WCJ-BERT/alpasim that referenced this pull request Apr 2, 2026
Address reviewer comments 3, 5, 6, 7, 9 on PR NVlabs#56.

Changes:
- Restructure DataSourceConfig to hierarchical format (Comment 3)
  - Separate common settings from source-specific configuration
  - Add USDZSourceConfig and NuPlanSourceConfig dataclasses
  - Add to_trajdata_params() conversion method
  - Makes it clear which parameters affect which data sources

- Make data_source required via Hydra MISSING (Comment 7)
  - Change from Optional[...] = None to MISSING
  - Leverage Hydra's built-in validation instead of runtime checks
  - Provides better error messages and fails fast

- Use dynamic sceneset_path instead of hardcoded path (Comment 9)
  - Preserves wizard's ability to create temporary sceneset directories
  - Add sceneset_path to scenes config with sensible default

- Remove unused configuration (Comments 5, 6)
  - Remove --usdz-glob command line argument
  - Remove artifact_cache_size config field

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WCJ-BERT added a commit to WCJ-BERT/alpasim that referenced this pull request Apr 2, 2026
Address reviewer comments 2, 4, 10, 12, 13 on PR NVlabs#56.

Changes:
- Remove unnecessary temporary variables (Comment 4)
  - Use data_source.rig.trajectory.time_range_us directly
  - Reduces cognitive load by eliminating intermediate variables

- Remove trajdata availability checks (Comment 10)
  - trajdata is a required dependency in pyproject.toml
  - Trust dependency management instead of runtime try-except guards
  - Removes TRAJDATA_AVAILABLE checks from prepare_data and runtime_context

- Improve documentation clarity (Comments 2, 12, 13)
  - Remove 'Key Dependencies' section from ONBOARDING.md (Comment 12)
  - Clarify data preparation is optional and automatic in TUTORIAL.md (Comment 13)
  - Remove misleading comment about desired_dt matching control_timestep_us (Comment 2)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WCJ-BERT added a commit to WCJ-BERT/alpasim that referenced this pull request Apr 2, 2026
Extract scene data loading logic from DaemonEngine into dedicated
SceneLoader class to improve separation of concerns and reduce
cognitive load.

Changes:
- Add SceneLoader class with lazy loading and caching
- Replace _dataset, _scene_id_to_idx, _scene_id_to_data_source
  with single _scene_loader field in DaemonEngine
- Update tests to mock SceneLoader instead of 3 separate fields

Addresses reviewer comment 14 on PR NVlabs#56.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WCJ-BERT added a commit to WCJ-BERT/alpasim that referenced this pull request Apr 10, 2026
Address reviewer comments 3, 5, 6, 7, 9 on PR NVlabs#56.

Changes:
- Restructure DataSourceConfig to hierarchical format (Comment 3)
  - Separate common settings from source-specific configuration
  - Add USDZSourceConfig and NuPlanSourceConfig dataclasses
  - Add to_trajdata_params() conversion method
  - Makes it clear which parameters affect which data sources

- Make data_source required via Hydra MISSING (Comment 7)
  - Change from Optional[...] = None to MISSING
  - Leverage Hydra's built-in validation instead of runtime checks
  - Provides better error messages and fails fast

- Use dynamic sceneset_path instead of hardcoded path (Comment 9)
  - Preserves wizard's ability to create temporary sceneset directories
  - Add sceneset_path to scenes config with sensible default

- Remove unused configuration (Comments 5, 6)
  - Remove --usdz-glob command line argument
  - Remove artifact_cache_size config field

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WCJ-BERT added a commit to WCJ-BERT/alpasim that referenced this pull request Apr 10, 2026
Address reviewer comments 2, 4, 10, 12, 13 on PR NVlabs#56.

Changes:
- Remove unnecessary temporary variables (Comment 4)
  - Use data_source.rig.trajectory.time_range_us directly
  - Reduces cognitive load by eliminating intermediate variables

- Remove trajdata availability checks (Comment 10)
  - trajdata is a required dependency in pyproject.toml
  - Trust dependency management instead of runtime try-except guards
  - Removes TRAJDATA_AVAILABLE checks from prepare_data and runtime_context

- Improve documentation clarity (Comments 2, 12, 13)
  - Remove 'Key Dependencies' section from ONBOARDING.md (Comment 12)
  - Clarify data preparation is optional and automatic in TUTORIAL.md (Comment 13)
  - Remove misleading comment about desired_dt matching control_timestep_us (Comment 2)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WCJ-BERT added a commit to WCJ-BERT/alpasim that referenced this pull request Apr 10, 2026
Extract scene data loading logic from DaemonEngine into dedicated
SceneLoader class to improve separation of concerns and reduce
cognitive load.

Changes:
- Add SceneLoader class with lazy loading and caching
- Replace _dataset, _scene_id_to_idx, _scene_id_to_data_source
  with single _scene_loader field in DaemonEngine
- Update tests to mock SceneLoader instead of 3 separate fields

Addresses reviewer comment 14 on PR NVlabs#56.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@WCJ-BERT WCJ-BERT force-pushed the adapt_to_new_runtime branch from 7b393fb to f2f1ced Compare April 10, 2026 07:43
Comment thread docs/TUTORIAL.md
uv run python -m alpasim_runtime.prepare_data \
--desired-data=usdz \
--data-dir=./data/nre-artifacts/all-usdzs \
--cache-location=./cache/trajdata_usdz \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not catching this earlier. One concern I have is that we can have two artifacts with the same clip id (which is I think how the caching is done) but with different versions. The way that we get around this in the existing approach is to have a UUID for the usdz filename and an a scene database (really a csv) to disambiguate

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this — you're right that trajdata uses the internal clip ID as the cache key, not the UUID.

If we want a stronger guarantee, we could incorporate the nre_version_string (e.g. 25.7.8-fc8b0551) into the cache_location path to fully isolate per-version caches. How do you think of this solution?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, i have a similar proposal. If we add ${scenes.sceneset_path} to base_config.yaml, is it helpful to distinguish diffferent versions' usdz? I think sceneset_path is the MD5 fingerprint of all UUIDs (sorted) selected for the run — so any version change in a scene produces a different UUID, which changes the hash, which routes to a new isolated cache directory.

runtime:
  # nr_workers and endpoints.*.n_concurrent_rollouts are set by topology configs.

  # Unified data source configuration using trajdata
  # Common settings apply to all sources; source-specific config lives under sources:
  data_source:
    # Common configuration (applies to all data sources)
    cache_location: "${defines.trajdata_cache}/${scenes.sceneset_path}"  # Per-sceneset cache, isolates different scene versions
    desired_dt: 0.1        # 10 Hz sampling rate for trajectories
    incl_vector_map: true  # Include vector map data (roads, lanes, etc.)
    rebuild_cache: false   # Set to true to force rebuild cache
    rebuild_maps: false    # Set to true to force rebuild maps
    num_workers: 4         # Parallel workers for cache creation

    sources:
      # USDZ data source configuration (NuRec artifacts)
      usdz:
        enabled: true
        # Use wizard's dynamic sceneset path. The wizard creates a sceneset directory
        # based on selected scenes and sets sceneset_path at runtime.
        # For manual runtime usage without wizard, set sceneset_path in scenes config.
        data_dir: "${scenes.scene_cache}/${scenes.sceneset_path}"
        extra_params:
          asset_base_path: null  # Optional: Base path for MTGS rendering assets****

WCJ-BERT and others added 17 commits April 24, 2026 09:16
Introduce SceneDataSource protocol to abstract scene data loading,
enabling support for multiple data sources (Artifact, trajdata, etc).

Add DataSourceConfig to runtime config for unified data source
configuration via trajdata's UnifiedDataset.

Changes:
- Add SceneDataSource protocol with standard interface (rig, traffic_objects, map, metadata)
- Add DataSourceConfig with trajdata UnifiedDataset parameters
- Support for both USDZ files and standard trajdata datasets
Implement TrajdataDataSource as a SceneDataSource that loads data
directly from trajdata datasets without requiring USDZ conversion.

Features:
- Lazy loading of rig, traffic_objects, map, and metadata
- Pre-created scene_cache to avoid pickle errors in multiprocessing
- Support for coordinate frame transformations (world to local NRE)
- Trajectory smoothing with cubic splines
- Camera calibration extraction from scene metadata
- Map loading and transformation to local coordinates

Benefits:
- Eliminate USDZ conversion overhead
- Reduce startup time with on-demand loading
- Lower memory usage per worker
Replace heavy scene_id_to_data_source dict with lightweight
scene_id_to_idx mapping in RuntimeContext. This enables efficient
serialization and reduces memory overhead when passing context
to worker processes.

Changes:
- RuntimeContext.scene_id_to_data_source → scene_id_to_idx
- Build scene_id to trajdata index mapping once at startup
- Workers can reconstruct data sources on-demand using the mapping

Benefits:
- Lightweight RuntimeContext (dict[str, int] vs dict[str, DataSource])
- Fully serializable with pickle for multiprocessing
- Avoids duplicating heavy data objects across workers
- Aligns with trajdata's index-based API

BREAKING CHANGE: RuntimeContext.scene_id_to_data_source field
replaced with scene_id_to_idx. Code accessing the old field must
be updated to use the new scene_id_to_idx mapping and load data
sources on-demand.
Adapt DaemonEngine to work with new RuntimeContext and support
on-demand scene data source loading from trajdata.

Changes:
- Store scene_id_to_idx mapping from RuntimeContext
- Create UnifiedDataset at engine startup for scene loading
- Add _get_data_source() method for lazy loading with caching
- Update build_pending_jobs_from_request to use callback pattern
- Pre-create scene_cache to avoid pickle errors

This enables daemon mode to efficiently handle multiple scenes
without loading all data upfront.
Add command-line tool for preprocessing scene data and building
trajdata cache before running simulations.

Features:
- Basic mode: preprocess all scenes in a dataset
- YAML config mode: batch process specific scenes from YAML files
- Central token mode: process scenes around specific tokens (NuPlan)
- Support for smooth_trajectories parameter
- Configurable cache rebuilding and vector map inclusion

Usage:
  # Basic preprocessing
  python -m alpasim_runtime.prepare_data --user-config user.yaml

  # With explicit parameters
  python -m alpasim_runtime.prepare_data \
    --desired-data nuplan_test \
    --data-dir /path/to/data \
    --cache-location /tmp/cache

This preprocessing step improves simulation startup time by
pre-building the trajdata cache.
Update worker processes and simulation entry points to use the
new SceneDataSource abstraction instead of direct Artifact access.

Changes:
- Worker IPC: PendingRolloutJob and AssignedRolloutJob use SceneDataSource
- Worker main: Pass data_source from job to rollout execution
- UnboundRollout: Accept SceneDataSource parameter
- Simulate CLI: Use RuntimeContext.scene_id_to_idx for scene lookup

This completes the migration from Artifact-based to SceneDataSource-based
data loading, enabling support for multiple data source backends.
Follow CONTRIBUTING.md naming conventions for coordinate frames.

Changes in trajdata_data_source.py:
- poses_vec3 → positions_agent_world
- poses_quat → quaternions_agent_world
- first_pose_position → position_ego_first_world
- first_pose_local → position_ego_first_local

These changes improve code readability by making coordinate frames
explicit in variable names, following the position_{what}_{frame}
naming pattern required by CONTRIBUTING.md.
Update all runtime tests to work with the new trajdata-based data flow:

- test_daemon_engine: Replace scene_id_to_artifact_path with scene_id_to_idx
  and mock scene_id_to_data_source cache in engine tests
- test_daemon_main: Remove usdz_glob parameter from DaemonEngine construction
- test_daemon_request_plumbing: Update build_pending_jobs_from_request tests
  to use get_data_source callable instead of scene_id_to_artifact_path dict
- test_config: Add unit tests for new DataSourceConfig class
- test_trajdata_integration: New integration tests for trajdata data source
  functionality, including caching, error handling, and job creation
- test_runtime_integration_replay: Add TODO note for future update (this
  manual integration test needs significant rework for new data flow)

All tests now reflect the shift from USDZ artifacts to lazy-loaded
SceneDataSource instances via trajdata UnifiedDataset.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update documentation and wizard configuration to support the new
trajdata-based data flow:

Documentation updates:
- README: Add Quick Start section explaining data preparation
- ONBOARDING: Document trajdata dependency (alpasim branch)
- TUTORIAL: Add Data Preparation section with prepare_data examples
- TUTORIAL: Fix deprecated --usdz-glob references in debug examples

Wizard configuration:
- Add defines.trajdata_cache for unified cache location
- Add runtime.data_source with USDZ defaults
- Configure recursive scanning of all-usdzs directory
- Set sensible defaults: 10Hz, vector maps, 4 workers

This ensures users understand the data preparation workflow and
wizard-generated configs work without manual edits.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address reviewer feedback: new files added in this PR should use
copyright year 2026 only, not 2025-2026.

Changed files:
- src/runtime/alpasim_runtime/prepare_data/__init__.py
- src/runtime/alpasim_runtime/prepare_data/__main__.py
- src/utils/alpasim_utils/scene_data_source.py
- src/utils/alpasim_utils/trajdata_data_source.py
Address reviewer comments 3, 5, 6, 7, 9 on PR NVlabs#56.

Changes:
- Restructure DataSourceConfig to hierarchical format (Comment 3)
  - Separate common settings from source-specific configuration
  - Add USDZSourceConfig and NuPlanSourceConfig dataclasses
  - Add to_trajdata_params() conversion method
  - Makes it clear which parameters affect which data sources

- Make data_source required via Hydra MISSING (Comment 7)
  - Change from Optional[...] = None to MISSING
  - Leverage Hydra's built-in validation instead of runtime checks
  - Provides better error messages and fails fast

- Use dynamic sceneset_path instead of hardcoded path (Comment 9)
  - Preserves wizard's ability to create temporary sceneset directories
  - Add sceneset_path to scenes config with sensible default

- Remove unused configuration (Comments 5, 6)
  - Remove --usdz-glob command line argument
  - Remove artifact_cache_size config field

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address reviewer comments 2, 4, 10, 12, 13 on PR NVlabs#56.

Changes:
- Remove unnecessary temporary variables (Comment 4)
  - Use data_source.rig.trajectory.time_range_us directly
  - Reduces cognitive load by eliminating intermediate variables

- Remove trajdata availability checks (Comment 10)
  - trajdata is a required dependency in pyproject.toml
  - Trust dependency management instead of runtime try-except guards
  - Removes TRAJDATA_AVAILABLE checks from prepare_data and runtime_context

- Improve documentation clarity (Comments 2, 12, 13)
  - Remove 'Key Dependencies' section from ONBOARDING.md (Comment 12)
  - Clarify data preparation is optional and automatic in TUTORIAL.md (Comment 13)
  - Remove misleading comment about desired_dt matching control_timestep_us (Comment 2)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address reviewer feedback (Comment 11): UnifiedDataset was being
created twice - once in build_runtime_context() and once in
DaemonEngine.startup(). This was wasteful as dataset creation
can be slow (directory scanning, cache loading).

Changes:
- Add dataset field to RuntimeContext
- DaemonEngine now uses runtime_context.dataset directly
- Remove duplicate UnifiedDataset creation from engine startup

The dataset is only used in the main process (not serialized to
workers), so adding it to RuntimeContext is safe and eliminates
the duplication overhead.
Extract scene data loading logic from DaemonEngine into dedicated
SceneLoader class to improve separation of concerns and reduce
cognitive load.

Changes:
- Add SceneLoader class with lazy loading and caching
- Replace _dataset, _scene_id_to_idx, _scene_id_to_data_source
  with single _scene_loader field in DaemonEngine
- Update tests to mock SceneLoader instead of 3 separate fields

Addresses reviewer comment 14 on PR NVlabs#56.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Major improvements:
- Unify preprocess_basic and preprocess_from_yaml_configs into single function
- Fix to_trajdata_params() to handle multiple datasets via dataset_kwargs
- Separate user-config and CLI paths with clear validation
- Extract process_nuplan_yaml_configs() for better modularity
- Simplify code structure and remove duplication

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
Resolves all 10 code quality issues raised in PR review:

1. Decompose map property (330→30 lines, -90%)
   - Extract helper methods: _transform_map_points, _apply_coordinate_transform_to_map
   - Split loading paths: _load_map_from_scene_data, _load_map_from_dataset_api
   - Separate concerns: transform, verify, fix datatypes

2. Reduce defensive hasattr checks (18→1, -94%)
   - Remove checks for guaranteed attributes based on trajdata source code
   - Keep only dynamic attribute: scene.map_data
   - Fix attribute naming: left_boundary/right_boundary → left_edge/right_edge

3. Simplify state accessor pattern
   - Add _get_state_value() helper to centralize state extraction
   - Remove unnecessary hasattr and array wrapping/unwrapping
   - Use StateArray.get_attr() directly (always available)

5. Remove coordinate transformation duplication
   - Inline simple translations (positions + translation)
   - Keep map-specific logic in _transform_map_points

6. Remove dead code
   - Delete from_agent_batch and _load_from_batch (NotImplementedError stub)

7. Move csaps import to top
   - Remove deferred import from loop body
   - Make trajdata and csaps required dependencies (remove try-except)

8. Fix non-deterministic metadata
   - Use SHA256 hash instead of random UUID for dataset_hash
   - Use fixed training_date instead of datetime.now()

9. Explicit dependency management
   - Add _ensure_rig_loaded() method with clear naming
   - Document property loading order in class docstring
   - Add comments at call sites

10. Simplify scene_id property
    - Remove unused setter (scene_id should be immutable)
    - Remove fallback paths (guaranteed at construction)
    - 12 lines → 3 lines

Additional improvements:
- Fix circular import: create daemon/exceptions.py module
- Fix OmegaConf access: add OmegaConf.to_object() conversions
- Fix asset_base_path logic: map by dataset name to avoid conflicts
- Remove has_rotation checks: world_to_nre only contains translation
- Pass map_api to workers: enable all datasets to load maps lazily
- Improve SceneLoader: build asset_base_path_map at init

Code quality metrics:
- Lines: 1054 → 925 (-12.2%)
- hasattr: 18 → 1 (-94%)
- map property: 330 → 30 lines (-90%)
- All pre-commit checks passing

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
@WCJ-BERT WCJ-BERT force-pushed the adapt_to_new_runtime branch from f2f1ced to 3dbe2c8 Compare April 24, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants