Jetpack 6.2 Support #1730

alexnorell · 2025-11-18T16:46:41Z

Description

Compiles PyTorch, torchvision, and onnxruntime from source to enable numpy 2.x support while achieving better performance and smaller image size than the wheel-based approach in #1718.

Benchmark Results

Performance on Jetson AGX Orin (TensorRT FP16):

RF-DETR base: 66 FPS @ 15.2ms (+6.3% vs 62.2 FPS baseline)
RF-DETR small: 73.6 FPS @ 13.6ms
RF-DETR nano: 95 FPS @ 10.7ms
0% error rate across all models

Image Size: 6.75GB (vs 8.28GB = 18.5% smaller)

Key Improvements

numpy 2.x support - No longer constrained by outdated Jetson wheels
6.3% faster inference - Jetson-optimized PyTorch compilation
18.5% smaller image - Symlink preservation and conservative cleanup
flash-attn 2.8.3 - Latest version with PyTorch built-in support

What's Compiled From Source

PyTorch 2.8.0 - Jetson Orin arch (TORCH_CUDA_ARCH_LIST="8.7"), disabled unnecessary features (USE_NCCL=0, USE_QNNPACK=0, USE_XNNPACK=0, USE_FBGEMM=0, USE_KINETO=0), ARM+CUDA linker optimization (USE_PRIORITIZED_TEXT_FOR_LD=1)
torchvision 0.23.0 - CUDA support
onnxruntime 1.20.0 - TensorRT EP with 4GB workspace, optimization level 5
flash-attn 2.8.3 - Latest version (removed legacy rotary, xentropy, fused_softmax, ft_attention)
GDAL 3.11.5 - Same as Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718

Size Optimizations

Total savings: ~2.4GB

cuDNN/TensorRT symlink preservation (not duplicates): ~2GB
Remove test directories: ~60MB (scipy/*/tests, pandas/tests, onnx/test)
Remove development tools: ~119MB (torch/bin, torch/include)
Remove Jupyter/debugpy packages: ~50MB
Remove examples/benchmarks/docs: ~50MB
Conservative cleanup preserving public APIs (numpy.testing, torch.testing, .pyi files)

Build Configuration

Infrastructure:

Depot public runner (2x cost savings vs GitHub Actions)
6 hour timeout for PyTorch/onnxruntime compilation
MAX_JOBS=12 to prevent OOM (32-core/64GB machine)

PyTorch build flags:

USE_CUDA=1 USE_CUDNN=1
TORCH_CUDA_ARCH_LIST="8.7"
USE_MKLDNN=0 USE_OPENMP=0
USE_DISTRIBUTED=0 USE_GLOO=0 USE_MPI=0 USE_TENSORPIPE=0 USE_NCCL=0
USE_QNNPACK=0 USE_PYTORCH_QNNPACK=0 USE_XNNPACK=0 USE_NNPACK=0
USE_FBGEMM=0 USE_KINETO=0 USE_CUPTI_SO=0
USE_FLASH_ATTENTION=1 USE_MEM_EFF_ATTENTION=1
USE_PRIORITIZED_TEXT_FOR_LD=1
CMAKE_BUILD_TYPE=Release BUILD_SHARED_LIBS=ON
MAX_JOBS=12

onnxruntime build flags:

--use_tensorrt --tensorrt_home /usr/lib/aarch64-linux-gnu
--parallel 12
--cmake_extra_defines onnxruntime_USE_FLASH_ATTENTION=OFF
--cmake_extra_defines onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION=OFF
--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="87"

TensorRT optimization (baked into image):

ORT_TENSORRT_FP16_ENABLE=1
ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
ORT_TENSORRT_MAX_WORKSPACE_SIZE=4294967296  # 4GB for maximum performance
ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL=5   # Maximum optimization
ONNXRUNTIME_EXECUTION_PROVIDERS=[TensorrtExecutionProvider]

Type of Change

Performance improvement (6.3% faster)
Size optimization (18.5% smaller)
Feature enablement (numpy 2.x)

How Has This Been Tested?

Build: Successfully built on Depot ARM64 builder (~1.5-2 hrs with caching)

Runtime: Container runs successfully, all imports working, GPU acceleration active

Benchmark: RF-DETR family (nano through medium) validated on Jetson AGX Orin with TensorRT

Deployment Considerations

First run: 10-15 min for TensorRT engine compilation (cached thereafter)
Use --volume ~/.inference/cache:/tmp:rw to persist TensorRT engine cache
Build time: ~1.5-2 hours on Depot with layer caching
Requires numpy>=2.0.0

alexnorell · 2025-11-20T23:25:51Z

Image Size Breakdown: 6.74GB

@sberan Here's the complete breakdown of what makes up the 6.74GB final image:

Base Layer (2.09GB)

nvcr.io/nvidia/l4t-cuda:12.6.11-runtime base image

Added Layers (4.65GB total)

1. cuDNN Libraries: ~1GB (with symlinks properly preserved)

libcudnn_adv.so.9.3.0: 276M
libcudnn_engines_precompiled.so.9.3.0: 487M
libcudnn_heuristic.so.9.3.0: 52M
Other cuDNN components: ~185M
Note: Only actual .so.X.Y.Z files copied, symlinks created afterward (saves ~2GB)

2. Python Packages: 2.67GB (after cleanup)

torch: ~600M (source-compiled, optimized for Jetson)
onnxruntime-gpu: ~350M (TensorRT EP)
bitsandbytes: 325M
jaxlib: 278M (required by mediapipe)
flash-attn: ~200M (v2.8.3)
scipy: ~90M
transformers: 55M
mediapipe: 55M
pandas: 47M
OpenCV (3 variants): ~160M
Other packages: ~500M

3. TensorRT Libraries: 986MB

libnvinfer*.so: Main TensorRT runtime
libnvonnxparser*.so: ONNX parser
libnvparsers*.so: Additional parsers

4. Runtime APT Packages: 247MB

libvips42, libopenblas0, libproj22, libavcodec58, libavformat58, etc.

5. GDAL: ~100MB

Binaries (gdal*, ogr*, gnm*): ~5MB
Libraries (libgdal*): 95MB
Data files: ~3MB

6. Other: ~200MB

Application code: ~8MB
cupti/nvToolsExt: ~22MB
Python dist-info metadata: ~20MB
Other system libs: ~150MB

Size Optimizations Applied

What we removed (~500MB saved):

Test directories: scipy/pandas/onnx tests (~60MB)
torch/bin, torch/include dev tools (~119MB)
Jupyter/IPython/debugpy packages (~50MB)
examples/benchmarks/docs across packages (~50MB)
skimage/data test images (~7.5MB)
pycache directories
GDAL/cuDNN headers (~1MB)

What we preserved (required for functionality):

numpy.testing, torch.testing (public APIs depend on them)
.pyi stub files (lazy_loader/type checkers need them)

What we optimized:

cuDNN/TensorRT symlinks instead of duplicates: ~2GB saved

Comparison

PR Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718 (wheels): 8.28GB, 62.2 FPS
PR Jetpack 6.2 Support #1730 (source): 6.74GB, 65.7 FPS
Improvement: 1.54GB smaller (18.6%), 5.6% faster

The source compilation approach produces a leaner, faster image despite compiling everything from scratch!

…upport - PyTorch 2.8.0 with Jetson Orin optimizations (arch 8.7, ARM+CUDA linker optimization) - Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.) - torchvision 0.23.0 with CUDA support - onnxruntime 1.20.0 with TensorRT EP - flash-attn 2.8.3 (latest version) Performance: 65.7 FPS (vs 62.2 FPS baseline = 5.6% faster) Image size: 6.74GB (vs 8.28GB baseline = 18.6% smaller) Size optimizations: - cuDNN/TensorRT symlink preservation: ~2GB saved - Remove test directories, dev tools, examples: ~500MB saved - Conservative cleanup preserving public APIs (numpy.testing, torch.testing) TensorRT optimization: - FP16 precision enabled - Engine caching enabled with 2GB workspace - Builder optimization level 3 - Aux streams optimized for memory efficiency

alexnorell requested review from PawelPeczek-Roboflow, grzegorz-roboflow, hansent, probicheaux and yeldarby as code owners November 18, 2025 16:46

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch 3 times, most recently from dd73a73 to 0bcc14f Compare November 18, 2025 22:32

alexnorell changed the base branch from jetson-620-cuda-base-pr to main November 19, 2025 22:50

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from c9590c9 to 63626a5 Compare November 20, 2025 23:48

alexnorell changed the title ~~[WIP] Compile PyTorch/torchvision from source for numpy 2.x support~~ Jetpack 6.2 Support Nov 20, 2025

alexnorell mentioned this pull request Nov 20, 2025

Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718

Closed

2 tasks

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 63626a5 to 502826c Compare November 21, 2025 00:00

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 502826c to f26bf0a Compare November 21, 2025 00:42

PawelPeczek-Roboflow approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into jetson-620-compile-pytorch-from-source

b81930d

alexnorell merged commit 59e39ab into main Nov 21, 2025
41 checks passed

alexnorell deleted the jetson-620-compile-pytorch-from-source branch November 21, 2025 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jetpack 6.2 Support #1730

Jetpack 6.2 Support #1730

alexnorell commented Nov 18, 2025 •

edited

Loading

Uh oh!

alexnorell commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jetpack 6.2 Support #1730

Jetpack 6.2 Support #1730

Conversation

alexnorell commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark Results

Key Improvements

What's Compiled From Source

Size Optimizations

Build Configuration

Type of Change

How Has This Been Tested?

Deployment Considerations

Uh oh!

alexnorell commented Nov 20, 2025

Image Size Breakdown: 6.74GB

Base Layer (2.09GB)

Added Layers (4.65GB total)

Size Optimizations Applied

Comparison

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexnorell commented Nov 18, 2025 •

edited

Loading