-
Notifications
You must be signed in to change notification settings - Fork 226
Jetpack 6.2 Support #1730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jetpack 6.2 Support #1730
Conversation
dd73a73 to
0bcc14f
Compare
Image Size Breakdown: 6.74GB@sberan Here's the complete breakdown of what makes up the 6.74GB final image: Base Layer (2.09GB)
Added Layers (4.65GB total)1. cuDNN Libraries: ~1GB (with symlinks properly preserved)
2. Python Packages: 2.67GB (after cleanup)
3. TensorRT Libraries: 986MB
4. Runtime APT Packages: 247MB
5. GDAL: ~100MB
6. Other: ~200MB
Size Optimizations AppliedWhat we removed (~500MB saved):
What we preserved (required for functionality):
What we optimized:
Comparison
The source compilation approach produces a leaner, faster image despite compiling everything from scratch! |
c9590c9 to
63626a5
Compare
63626a5 to
502826c
Compare
…upport - PyTorch 2.8.0 with Jetson Orin optimizations (arch 8.7, ARM+CUDA linker optimization) - Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.) - torchvision 0.23.0 with CUDA support - onnxruntime 1.20.0 with TensorRT EP - flash-attn 2.8.3 (latest version) Performance: 65.7 FPS (vs 62.2 FPS baseline = 5.6% faster) Image size: 6.74GB (vs 8.28GB baseline = 18.6% smaller) Size optimizations: - cuDNN/TensorRT symlink preservation: ~2GB saved - Remove test directories, dev tools, examples: ~500MB saved - Conservative cleanup preserving public APIs (numpy.testing, torch.testing) TensorRT optimization: - FP16 precision enabled - Engine caching enabled with 2GB workspace - Builder optimization level 3 - Aux streams optimized for memory efficiency
502826c to
f26bf0a
Compare
Description
Compiles PyTorch, torchvision, and onnxruntime from source to enable numpy 2.x support while achieving better performance and smaller image size than the wheel-based approach in #1718.
Benchmark Results
Performance on Jetson AGX Orin (TensorRT FP16):
Image Size: 6.75GB (vs 8.28GB = 18.5% smaller)
Key Improvements
What's Compiled From Source
TORCH_CUDA_ARCH_LIST="8.7"), disabled unnecessary features (USE_NCCL=0,USE_QNNPACK=0,USE_XNNPACK=0,USE_FBGEMM=0,USE_KINETO=0), ARM+CUDA linker optimization (USE_PRIORITIZED_TEXT_FOR_LD=1)rotary,xentropy,fused_softmax,ft_attention)Size Optimizations
Total savings: ~2.4GB
scipy/*/tests,pandas/tests,onnx/test)torch/bin,torch/include)numpy.testing,torch.testing,.pyifiles)Build Configuration
Infrastructure:
MAX_JOBS=12to prevent OOM (32-core/64GB machine)PyTorch build flags:
onnxruntime build flags:
TensorRT optimization (baked into image):
Type of Change
How Has This Been Tested?
Build: Successfully built on Depot ARM64 builder (~1.5-2 hrs with caching)
Runtime: Container runs successfully, all imports working, GPU acceleration active
Benchmark: RF-DETR family (nano through medium) validated on Jetson AGX Orin with TensorRT
Deployment Considerations
--volume ~/.inference/cache:/tmp:rwto persist TensorRT engine cachenumpy>=2.0.0