Benchmark comparison for Jetson Nano, TX2, Xavier NX and AGX

Author: Fyodor Serzhenko

NVIDIA has released a series of Jetson hardware modules for embedded applications. NVIDIA® Jetson is the world's leading embedded platform for image processing and DL/AI tasks. Its high-performance, low-power computing for deep learning and computer vision makes it the ideal platform for mobile compute-intensive projects.

We've developed an Image & Video Processing SDK for NVIDIA Jetson hardware. Here we present performance benchmarks for the available Jetson modules. As an image processing pipeline, we consider a basic camera application as a good example for benchmarking.

Jetson Performance Benchmark Comparison: Nano vs TX2 vs NX vs AGX

 

Hardware features for Jetson Nano, TX2, Xavier NX and AGX Xavier

Here we present a brief comparison for Jetsons hardware features to see the progress and variety of mobile solutions from NVIDIA. These units are aimed at different markets and tasks.

Table 1. Hardware comparison for Jetson modules

Hardware feature \ Jetson module Jetson Nano Jetson TX2/TX2i Jetson NX Xavier Jetson AGX Xavier
CPU (ARM) 4-core ARM A57 @ 1.43 GHz 4-core ARM Cortex-A57 @ 2 GHz, 2-core Denver2 @ 2 GHz 6-core ARM Carmel v8.2 8-core ARM Carmel v.8.2 @ 2.26 GHz
GPU 128-core Maxwell @ 921 MHz 256-core Pascal @ 1.3 GHz 384-core Volta 512-core Volta @ 1.37 GHz
Memory 4 GB LPDDR4, 25.6 GB/s 8 GB 128-bit LPDDR4, 58.3 GB/s 8 GB 128-bit LPDDR4, 51.2GB/s 16 GB 256-bit LPDDR4, 137 GB/s
Storage MicroSD 32 GB eMMC 5.1 16 GB eMMC 5.1 32 GB eMMC 5.1
Tensor cores -- -- 48 64
Video encoding 1x 4K30 (H.265) 2x 1080p60 (H.265) 1x 4K60 (H.265) 3x 4K30 (H.265) 4x 1080p60 (H.265) 2x 4K30 (H.265) 6x 1080p60 (H.265) 4x 4K60 (H.265) 16x 1080p60 (H.265) 32x 1080p30 (H.265)
Video decoding 1x 4K60 (H.265) 4x 1080p60 (H.265) 2x 4K60 (H.265) 7x 1080p60 (H.265) 14x 1080p30 (H.265) 2x 4K60 (H.265) 12x 1080p60 (H.265) 16x 1080p30 (H.265) 2x 8K30 (H.265) 6x 4K60 (H.265) 26x 1080p60 (H.265) 72x 1080p30 (H.265)
USB (4x) USB 3.0 + Micro-USB 2.0 (1x) USB 3.0 + (1x) USB 2.0 (3x) USB 3.1 + (4x) USB 2.0 (3x) USB 3.1 + (4x) USB 2.0
PCI-Express lanes 4 lanes PCIe Gen 2 5 lanes PCIe Gen 2 1 x1 (PCIe Gen3)+ 1 x4 (PCIe Gen4) 16 lanes PCIe Gen 4
Power 5W / 10W 7.5W / 15W 10W / 15W 10W / 15W / 30W

In camera applications, we can usually hide Host-to-Device transfers by implementing GPU Zero Copy or by overlapping GPU copy/compute. Device-to-Host transfers can be hidden via copy/compute overlap.

Hardware and software for benchmarking

  • CPU/GPU NVIDIA Jetson Nano, TX2, Xavier NX and AGX Xavier
  • OS L4T (Ubuntu 18.04)
  • CUDA Toolkit 10.2 for Jetson Nano, TX2, Xavier NX and AGX Xavier
  • Fastvideo SDK 0.16.4

NVIDIA Jetson Comparison: Nano vs TX2 vs Xavier NX vs AGX Xavier

For these NVIDIA Jetson modules, we've done performance benchmarking for the following standard image processing tasks which are specific for camera applications: white balance, demosaic (debayer), color correction, resize, JPEG encoding, etc. That's not the full set of Fastvideo SDK features, but it's just an example to see what kind of performance we could get from each Jetson. You can also choose a particular debayer algorithm and output compression (JPEG or JPEG2000) for your pipeline.

nvidia jetson image processing sdk

Table 2. GPU kernel times for 2K image processing (1920×1080, 16 bits per channel, milliseconds)

Algorithm and parameters / Jetson model Jetson Nano Jetson TX2/TX2i Jetson Xavier NX Jetson AGX Xavier
White Balance 0.6 0.24 0.19 0.08
L7 Debayer (window 7×7) 1.95 0.87 0.61 0.40
DFPD Debayer (window 11×11) 4.7 2.06 1.08 0.95
MG Debayer (window 23×23) 12.7 5.9 2.73 2.2
Color Correction with 3×4 matrix 1.7 0.81 0.55 0.25
Resize from 2K to 960×540 10.0 4.3 2.21 1.5
Resize from 2K to 1919×1079 19.8 8.2 4.34 2.4
Gamma (1920×1080) 1.4 0.84 0.42 0.2
JPEG compression (1920×1080, 90%, 4:2:0) 4.3 1.7 1.09 0.62
JPEG compression (1920×1080, 90%, 4:4:4) 6.8 2.6 1.5 0.75
Total for simple camera pipeline (ms) 9.95 4.8 2.85 1.53

 

Total kernel times are calculated for the values from the colored rows of the table. This is done to show the maximum performance benchmarks on GPU for a specified set of image processing modules which correspond to camera applications. These are estimates for GPU processing times, though there are pauses between kernels and CPU-based processing also influences on total processing time which will be more than total kernel time.

Each Jetson module was run with maximum performance

  • MAX-N mode for Jetson AGX Xavier
  • 15W for Jetson Xavier NX and Jetson TX2
  • 10W for Jetson Nano

Here we've compared just the basic set of image processing modules from Fastvideo SDK to let Jetson developers evaluate the expected performance before building their imaging applications. Image processing from RAW to RGB or RAW to JPEG are standard tasks, and now developers can get detailed info about expected performance for the chosen pipeline according to the table above. We haven't tested Jetson H.264 and H.265 encoders and decoders in that pipeline. As soon as H.264 and H.265 encoders are working at the hardware level, encoding can be done in parallel with CUDA code, so we should be able to get even better performance.

We've done the same kernel time measurements for NVIDIA GeForce and Quadro GPUs. Here you can get the document with the benchmarks.

Software for Jetson performance comparison

We've released the software for a GPU-based camera application on GitHub, and it's available to download both binaries and source codes for our gpu camera sample project. It's implemented for Windows 7/10, Linux Ubuntu 18.04 and L4T. Apart from a full image processing pipeline on GPU for still images from SSD and for live camera output, there are options for streaming and for glass-to-glass (G2G) measurements to evaluate real latency for camera systems on Jetson. The software currently works with machine vision cameras from XIMEA, Basler, FLIR, JAI, Matrix Vision, Imperex, Daheng Imaging, etc.

To check the performance of Fastvideo SDK on a laptop/desktop/server GPU without any programming, you can download Fast CinemaDNG Processor software with GUI for Windows or Linux. That software has a Performance Benchmarks window, and there you can see timing for each stage of image processing. This is a more sofisticated method of performance testing, because the image processing pipeline in that software can be quite advanced, and you can test any module you need. You can also perform various tests on images with different resolutions to see how much the performance depends on image size, content and other parameters.

Other blog posts from Fastvideo about Jetson hardware and software

Contact Form

This form collects your name and email. Check out our Privacy Policy on how we protect and manage your personal data.