Skip to content

Conversation

@alexnorell
Copy link
Contributor

@alexnorell alexnorell commented Nov 14, 2025

Description

Optimizes the Jetson 6.2.0 Docker image by switching from the full l4t-jetpack base (~14 GB) to the minimal l4t-cuda:12.6.11-runtime base (~8 GB), achieving a ~40% size reduction while improving CUDA version and maintaining full functionality.

Key Improvements

Image Optimization:

  • 41.7% smaller: 14.2 GB → 8.28 GB (5.92 GB savings)
  • l4t-jetpack → l4t-cuda: Eliminates unnecessary JetPack SDK components (VPI, multimedia APIs, GStreamer)
  • CUDA 12.6.11: Upgraded from 12.2 (matches JetPack 6.2 official version)
  • 2-stage build: JetPack builder for compilation tools + minimal CUDA runtime for deployment

Software Stack:

  • onnxruntime-gpu 1.20.0 (compiled with CUDA 12.6 + TensorRT support)
  • PyTorch 2.8.0 from jetson-ai-lab.io
  • NumPy 1.26.4 (Jetson PyTorch compatibility)
  • CMake 3.31.10 (parameterized build arg)
  • GDAL 3.11.5 (compiled from source)
  • cuDNN 9.3 + TensorRT with FP16 acceleration

Performance:

  • TensorRT execution provider enabled by default
  • FP16 precision for faster inference
  • Engine caching for instant subsequent runs

Benchmark Results

RF-DETR Base on Jetson AGX Orin with TensorRT:

  • 62.2 FPS @ 16.0ms average latency
  • 0% error rate (1000/1000 successful inferences)
  • ±1.1ms standard deviation (very consistent)
  • Percentiles: P50=16.3ms, P75=16.6ms, P90=18.3ms, P99=18.6ms

Test config: rfdetr-base (29M params), COCO dataset, batch_size=1, 560x560 input, TensorRT FP16

Command:

inference benchmark python-package-speed -m rfdetr-base -d coco -bi 1000

Technical Details

Why l4t-cuda instead of l4t-jetpack:

  • l4t-jetpack (14.2 GB): Full JetPack SDK including VPI, multimedia codecs, GStreamer, samples, and development tools
  • l4t-cuda (8.28 GB final): Just CUDA runtime + extracted essentials (cuDNN, TensorRT libs) from JetPack
  • Result: Faster downloads, less storage, cleaner dependency management, newer CUDA

Multi-stage build:

  1. Builder uses l4t-jetpack:r36.4.0 for compilation (CUDA dev tools, nvcc)
  2. Runtime uses l4t-cuda:12.6.11-runtime with only necessary libs copied from builder
  3. Extracts cuDNN 9.3 and TensorRT from JetPack for PyTorch compatibility

Dependency Management:
Created 5 Jetson-specific requirements files to avoid numpy/torch version conflicts:

  • _requirements.jetson.txt - Core deps without numpy
  • requirements.jetson.6.2.0.txt - Platform deps with numpy<2.0.0
  • requirements.transformers.jetson.txt - Transformers without torch
  • requirements.sam.jetson.txt - SAM without torch
  • requirements.sdk.http.jetson.txt - SDK without numpy

Why numpy<2.0.0: Jetson PyTorch 2.8.0 wheels compiled against numpy 1.x C-API (numpy 2.0 broke ABI compatibility 17 months ago, Jetson hasn't updated yet).

Type of change

  • Performance improvement (reduces image size, faster inference)
  • This change modifies the Jetson 6.2.0 Dockerfile

How has this change been tested?

Build: Successfully built on Jetson AGX Orin (~40 min full build)
Runtime: Container runs successfully, all imports working, GPU acceleration active
Benchmark: RF-DETR 62.2 FPS with TensorRT verified on Jetson AGX Orin

Deployment considerations

  • First run: 15+ min for TensorRT engine compilation (cached thereafter)
  • Use --volume ~/.inference/cache:/tmp:rw to persist TensorRT cache
  • MAXN mode recommended for best performance
  • numpy<2.0.0 required for Jetson PyTorch 2.8.0 compatibility

Docs

N/A

alexnorell and others added 2 commits November 17, 2025 13:53
…uction)

Replace full l4t-jetpack base image with lighter l4t-cuda:12.6.11-runtime
for Jetson 6.2.0 inference server deployment. This optimization reduces
image size from 14.2 GB to 8.28 GB (41.7% reduction) while maintaining
full functionality and improving CUDA version to 12.6.11.

Key improvements:
- New Dockerfile using l4t-cuda:12.6.11-runtime as base
- Multi-stage build: JetPack builder + minimal CUDA runtime
- Compiled onnxruntime-gpu with CUDA 12.6 and TensorRT support
- GDAL 3.11.5 compiled from source with Ninja build system
- PyTorch 2.8.0 with CUDA 12.6 support from jetson-ai-lab.io
- TensorRT FP16 acceleration enabled by default
- Python symlink for inference CLI compatibility

Performance:
- RF-DETR Base benchmark: 27.2 FPS @ 36.8ms avg latency
- TensorRT acceleration with FP16 precision
- Zero errors over 1000 inference cycles
- Low latency variance (±1.1ms std dev)

Technical details:
- Extracts cuDNN 9.3 and TensorRT libs from JetPack for compatibility
- Uses uv for fast Python package installation
- CMake 3.30.5 for building extensions
- 12-core parallel builds for onnxruntime compilation

Files changed:
- docker/dockerfiles/Dockerfile.onnx.jetson.6.2.0 (completely rewritten)
- requirements/*.txt (updated dependencies for Jetson 6.2.0)

Generated with Claude Code
Co-Authored-By: Claude <[email protected]>
@alexnorell alexnorell force-pushed the jetson-620-cuda-base-pr branch from 3c7a245 to b06c55d Compare November 17, 2025 21:53
- Set CMAKE_VERSION, TORCH_VERSION, and TORCHVISION_VERSION as build args
- Use latest CMake 4.1.2
- Simplify all comments throughout Dockerfile
- Create requirements.jetson.6.2.0.txt with Jetson-specific dependencies
- Keep numpy<2.0.0, torch>=2.8.0, torchvision>=0.23.0, flash-attn==2.8.2
- Don't modify shared requirements files to avoid breaking other builds
- Update Dockerfile to use requirements.jetson.6.2.0.txt instead of requirements.jetson.txt
…ments

- Remove requirements.transformers.txt and requirements.sam.txt from uv install
- These files specify torch<2.7.0 which conflicts with Jetson's torch>=2.8.0
- Torch 2.8.0 is already installed from jetson-ai-lab.io before this step
- Fixes build error: 'your requirements are unsatisfiable'
- Create requirements.transformers.jetson.txt without torch/torchvision
- Create requirements.sam.jetson.txt without torch/torchvision/flash-attn
- Update Dockerfile to use Jetson-specific requirements files
- Prevents dependency conflicts with pre-installed Jetson PyTorch 2.8.0
- Create _requirements.jetson.txt without numpy specification
- Update Dockerfile to use _requirements.jetson.txt
- Prevents conflict between numpy<2.0.0 (Jetson) and numpy>=2.0.0 (main)
- Create requirements.sdk.http.jetson.txt without numpy
- Update Dockerfile to use sdk.http.jetson.txt
- CMake 4.1.2 is incompatible with onnxruntime v1.20.0 dependencies
- Revert to CMake 3.30.5 which is known to work
- Use latest CMake 3.x version (3.31.10)
- CMake 4.x incompatible with onnxruntime v1.20.0
- Some dependency is pulling in numpy 2.x despite exclusions
- Explicitly install numpy<2.0.0 after all other packages
- Ensures onnxruntime compiled with numpy 1.x can run
- Install numpy>=2.0.0,<2.3.0 before PyTorch and onnxruntime build
- Remove numpy<2.0.0 constraint from Jetson requirements
- onnxruntime will now be compiled against numpy 2.x headers
- Allows using modern numpy 2.x in production
- Jetson PyTorch 2.8.0 wheels from jetson-ai-lab.io compiled with numpy 1.x
- Cannot use numpy 2.x until Jetson provides updated PyTorch wheels
- Force numpy<2.0.0 after all dependencies to ensure compatibility
This prototype uses l4t-cuda:12.6.11-runtime for 31.5% size reduction while
maintaining full functionality.

Key features:
- 2-stage build: JetPack builder + CUDA runtime
- GDAL 3.11.5, onnxruntime 1.20.0 compiled from source
- cuDNN, TensorRT, CUDA libs copied from JetPack
- TensorRT execution providers configured for ONNX models
- All inference packages built as wheels

Result: 9.73 GB vs 14.2 GB (4.47 GB savings)
@alexnorell
Copy link
Contributor Author

alexnorell commented Nov 18, 2025

Reviewed the final image composition to identify optimization opportunities. I'm thinking this is as close as we're going to get without compiling everything from source.

Largest components (all required):

  • 3.31 GB: Python packages (/usr/local/lib/python3.10/dist-packages) - needed for all inference models
  • 3.01 GB: cuDNN libraries - required for PyTorch and TensorRT
  • 1.81 GB: CUDA libraries - required for GPU acceleration
  • 986 MB: TensorRT libraries - required for fast inference
  • 199 MB: Runtime dependencies (apt packages) - minimal set needed

Already optimized:

  • Using minimal l4t-cuda:12.6.11-runtime base (not full JetPack SDK)
  • No development packages in runtime stage
  • Apt cache cleaned (rm -rf /var/lib/apt/lists/*)
  • uv cache cleaned (rm -rf ~/.cache/uv)
  • Multi-stage build (builder artifacts not copied to runtime)

- Default: 12 (for Jetson with 12 cores)
- GHA/Depot: 3 (to avoid OOM on CI runners)
- Allows flexible parallelism based on build environment
@alexnorell
Copy link
Contributor Author

✅ Depot Build Successful

The Jetson 6.2.0 Docker image built successfully on Depot infrastructure!

Build Run: https://github.com/roboflow/inference/actions/runs/19456686627

Image Tags Produced:

  • roboflow/roboflow-inference-server-jetson-6.2.0:latest
  • roboflow/roboflow-inference-server-jetson-6.2.0:0.61.0

Size: 8.28 GB (41.7% smaller than l4t-jetpack base)

The optimized image is validated and ready for deployment on Jetson 6.2.0 devices.

- Merge requirements.jetson.6.2.0.txt into _requirements.jetson.txt
- Eliminates redundant file since torch/torchvision already installed separately
- Now 4 Jetson requirements files instead of 5
@alexnorell alexnorell mentioned this pull request Nov 18, 2025
3 tasks
@alexnorell
Copy link
Contributor Author

Closing in favor of #1730

@alexnorell alexnorell closed this Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants