Skip to content

Conversation

@alexnorell
Copy link
Contributor

@alexnorell alexnorell commented Nov 18, 2025

Description

Compiles PyTorch, torchvision, and onnxruntime from source to enable numpy 2.x support while achieving better performance and smaller image size than the wheel-based approach in #1718.

Benchmark Results

Performance on Jetson AGX Orin (TensorRT FP16):

  • RF-DETR base: 66 FPS @ 15.2ms (+6.3% vs 62.2 FPS baseline)
  • RF-DETR small: 73.6 FPS @ 13.6ms
  • RF-DETR nano: 95 FPS @ 10.7ms
  • 0% error rate across all models

Image Size: 6.75GB (vs 8.28GB = 18.5% smaller)

Key Improvements

  • numpy 2.x support - No longer constrained by outdated Jetson wheels
  • 6.3% faster inference - Jetson-optimized PyTorch compilation
  • 18.5% smaller image - Symlink preservation and conservative cleanup
  • flash-attn 2.8.3 - Latest version with PyTorch built-in support

What's Compiled From Source

  • PyTorch 2.8.0 - Jetson Orin arch (TORCH_CUDA_ARCH_LIST="8.7"), disabled unnecessary features (USE_NCCL=0, USE_QNNPACK=0, USE_XNNPACK=0, USE_FBGEMM=0, USE_KINETO=0), ARM+CUDA linker optimization (USE_PRIORITIZED_TEXT_FOR_LD=1)
  • torchvision 0.23.0 - CUDA support
  • onnxruntime 1.20.0 - TensorRT EP with 4GB workspace, optimization level 5
  • flash-attn 2.8.3 - Latest version (removed legacy rotary, xentropy, fused_softmax, ft_attention)
  • GDAL 3.11.5 - Same as Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718

Size Optimizations

Total savings: ~2.4GB

  • cuDNN/TensorRT symlink preservation (not duplicates): ~2GB
  • Remove test directories: ~60MB (scipy/*/tests, pandas/tests, onnx/test)
  • Remove development tools: ~119MB (torch/bin, torch/include)
  • Remove Jupyter/debugpy packages: ~50MB
  • Remove examples/benchmarks/docs: ~50MB
  • Conservative cleanup preserving public APIs (numpy.testing, torch.testing, .pyi files)

Build Configuration

Infrastructure:

  • Depot public runner (2x cost savings vs GitHub Actions)
  • 6 hour timeout for PyTorch/onnxruntime compilation
  • MAX_JOBS=12 to prevent OOM (32-core/64GB machine)

PyTorch build flags:

USE_CUDA=1 USE_CUDNN=1
TORCH_CUDA_ARCH_LIST="8.7"
USE_MKLDNN=0 USE_OPENMP=0
USE_DISTRIBUTED=0 USE_GLOO=0 USE_MPI=0 USE_TENSORPIPE=0 USE_NCCL=0
USE_QNNPACK=0 USE_PYTORCH_QNNPACK=0 USE_XNNPACK=0 USE_NNPACK=0
USE_FBGEMM=0 USE_KINETO=0 USE_CUPTI_SO=0
USE_FLASH_ATTENTION=1 USE_MEM_EFF_ATTENTION=1
USE_PRIORITIZED_TEXT_FOR_LD=1
CMAKE_BUILD_TYPE=Release BUILD_SHARED_LIBS=ON
MAX_JOBS=12

onnxruntime build flags:

--use_tensorrt --tensorrt_home /usr/lib/aarch64-linux-gnu
--parallel 12
--cmake_extra_defines onnxruntime_USE_FLASH_ATTENTION=OFF
--cmake_extra_defines onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION=OFF
--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="87"

TensorRT optimization (baked into image):

ORT_TENSORRT_FP16_ENABLE=1
ORT_TENSORRT_ENGINE_CACHE_ENABLE=1
ORT_TENSORRT_MAX_WORKSPACE_SIZE=4294967296  # 4GB for maximum performance
ORT_TENSORRT_BUILDER_OPTIMIZATION_LEVEL=5   # Maximum optimization
ONNXRUNTIME_EXECUTION_PROVIDERS=[TensorrtExecutionProvider]

Type of Change

  • Performance improvement (6.3% faster)
  • Size optimization (18.5% smaller)
  • Feature enablement (numpy 2.x)

How Has This Been Tested?

Build: Successfully built on Depot ARM64 builder (~1.5-2 hrs with caching)

Runtime: Container runs successfully, all imports working, GPU acceleration active

Benchmark: RF-DETR family (nano through medium) validated on Jetson AGX Orin with TensorRT

Deployment Considerations

  • First run: 10-15 min for TensorRT engine compilation (cached thereafter)
  • Use --volume ~/.inference/cache:/tmp:rw to persist TensorRT engine cache
  • Build time: ~1.5-2 hours on Depot with layer caching
  • Requires numpy>=2.0.0

@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch 3 times, most recently from dd73a73 to 0bcc14f Compare November 18, 2025 22:32
@alexnorell alexnorell changed the base branch from jetson-620-cuda-base-pr to main November 19, 2025 22:50
@alexnorell
Copy link
Contributor Author

Image Size Breakdown: 6.74GB

@sberan Here's the complete breakdown of what makes up the 6.74GB final image:

Base Layer (2.09GB)

  • nvcr.io/nvidia/l4t-cuda:12.6.11-runtime base image

Added Layers (4.65GB total)

1. cuDNN Libraries: ~1GB (with symlinks properly preserved)

  • libcudnn_adv.so.9.3.0: 276M
  • libcudnn_engines_precompiled.so.9.3.0: 487M
  • libcudnn_heuristic.so.9.3.0: 52M
  • Other cuDNN components: ~185M
  • Note: Only actual .so.X.Y.Z files copied, symlinks created afterward (saves ~2GB)

2. Python Packages: 2.67GB (after cleanup)

  • torch: ~600M (source-compiled, optimized for Jetson)
  • onnxruntime-gpu: ~350M (TensorRT EP)
  • bitsandbytes: 325M
  • jaxlib: 278M (required by mediapipe)
  • flash-attn: ~200M (v2.8.3)
  • scipy: ~90M
  • transformers: 55M
  • mediapipe: 55M
  • pandas: 47M
  • OpenCV (3 variants): ~160M
  • Other packages: ~500M

3. TensorRT Libraries: 986MB

  • libnvinfer*.so: Main TensorRT runtime
  • libnvonnxparser*.so: ONNX parser
  • libnvparsers*.so: Additional parsers

4. Runtime APT Packages: 247MB

  • libvips42, libopenblas0, libproj22, libavcodec58, libavformat58, etc.

5. GDAL: ~100MB

  • Binaries (gdal*, ogr*, gnm*): ~5MB
  • Libraries (libgdal*): 95MB
  • Data files: ~3MB

6. Other: ~200MB

  • Application code: ~8MB
  • cupti/nvToolsExt: ~22MB
  • Python dist-info metadata: ~20MB
  • Other system libs: ~150MB

Size Optimizations Applied

What we removed (~500MB saved):

  • Test directories: scipy/pandas/onnx tests (~60MB)
  • torch/bin, torch/include dev tools (~119MB)
  • Jupyter/IPython/debugpy packages (~50MB)
  • examples/benchmarks/docs across packages (~50MB)
  • skimage/data test images (~7.5MB)
  • pycache directories
  • GDAL/cuDNN headers (~1MB)

What we preserved (required for functionality):

  • numpy.testing, torch.testing (public APIs depend on them)
  • .pyi stub files (lazy_loader/type checkers need them)

What we optimized:

  • cuDNN/TensorRT symlinks instead of duplicates: ~2GB saved

Comparison

The source compilation approach produces a leaner, faster image despite compiling everything from scratch!

@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from c9590c9 to 63626a5 Compare November 20, 2025 23:48
@alexnorell alexnorell changed the title [WIP] Compile PyTorch/torchvision from source for numpy 2.x support Jetpack 6.2 Support Nov 20, 2025
@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 63626a5 to 502826c Compare November 21, 2025 00:00
…upport

- PyTorch 2.8.0 with Jetson Orin optimizations (arch 8.7, ARM+CUDA linker optimization)
- Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.)
- torchvision 0.23.0 with CUDA support
- onnxruntime 1.20.0 with TensorRT EP
- flash-attn 2.8.3 (latest version)

Performance: 65.7 FPS (vs 62.2 FPS baseline = 5.6% faster)
Image size: 6.74GB (vs 8.28GB baseline = 18.6% smaller)

Size optimizations:
- cuDNN/TensorRT symlink preservation: ~2GB saved
- Remove test directories, dev tools, examples: ~500MB saved
- Conservative cleanup preserving public APIs (numpy.testing, torch.testing)

TensorRT optimization:
- FP16 precision enabled
- Engine caching enabled with 2GB workspace
- Builder optimization level 3
- Aux streams optimized for memory efficiency
@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 502826c to f26bf0a Compare November 21, 2025 00:42
@alexnorell alexnorell merged commit 59e39ab into main Nov 21, 2025
41 checks passed
@alexnorell alexnorell deleted the jetson-620-compile-pytorch-from-source branch November 21, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants