cuda jpeg codec


We have created fast JPEG codec based on NVIDIA CUDA technology. CUDA JPEG codec developed by Fastvideo is a blend of strict compliance with standards and shocking encoding and decoding speed comparing with existing commercial solutions. This is full, performance-oriented implementation of Baseline JPEG. We got ultra fast JPEG compression and decompression on the GPU due to full parallel implementation of Baseline JPEG algorithm. Our JPEG codec is much faster in comparison with the best commercial multithreaded JPEG codecs for multicore CPUs.

JPEG logoFast JPEG image compression features for CUDA JPEG codec

  • Implementation is 100% compliant with JPEG Baseline Standard
  • Baseline JPEG compression and decompression for grayscale (8-bit) and color (24-bit) images with arbitrary width and height
  • Optional 12-bit JPEG compression for grayscale and color
  • Extremely fast lossy image encoding and decoding with variable compression ratio
  • Subsampling modes: 4:4:4, 4:2:2, 4:2:0
  • Minimum input image size 1×1 for grayscale and color images with any subsampling
  • Maximum input image size is 12,000 × 12,000 or more (optional)
  • JPEG image quality in the range from 1 to 100
  • Read/edit/write any EXIF section
  • Optional parameters: quantization tables
  • Data input: 8/24-bit or 12/36-bit images from RAM/HDD/RAID/SSD/GPU
  • Data output: final compressed/uncompressed 8/24-bit or 12/36-bit image in RAM/HDD/RAID/SSD/GPU
  • Standard input formats: PGM, YUV, PPM, BMP
  • Continuous data mode (input one image after another)
  • Standard set of computations for parallel implementation of Baseline JPEG compression and decompression
    • JPEG Encoding on GPU: Input data parcing, Color Transform, 2D DCT, Quantization, Zig-zag, AC/DC, DPCM, RLE, Huffman coding, Byte stuffing, JFIF formatting
    • JPEG Decoding on GPU: JFIF parcing, Restart marker search, Inverse Huffman decoding, Inverse RLE, Inverse DPCM, AC/DC, Inverse Zig-zag, Inverse Quantization, Inverse DCT, Inverse Color Transform, Output formatting
  • Optimized for the latest NVIDIA GPUs
  • Compatibility with FFmpeg to read/write MJPEG streams (FFmpeg is under LGPLv2.1)
  • Optional integration with OpenGL
  • Optional support for input from HD-SDI cards (Bluefish, Deltacast, Imperx)
  • Compatible with Windows-7/8/10 and Linux (32/64)

We have succeeded to make parallel all stages of JPEG algorithm including entropy encoding and decoding. There was a widespread opinion that Huffman algorithm could be only serial. In our solution Huffman coding is not a bottleneck anymore and it's fully parallel. Now we don't off-load anything from GPU to CPU to make JPEG codec faster. CUDA JPEG codec is extremely fast and is functioning completely on GPU.

Benchmarks for JPG encoding on NVIDIA GeForce GTX 1080 (Windows-7 and CUDA-7.5, 64-bit)

Now we need just 0.78 ms for Baseline JPEG encoding of 24-bit color image with 4K resolution 3840 × 2160, JPEG quality 90% and subsampling 4:2:0 (it corresponds to image compression ratio ~10:1). If we include DeviceIO latency (copy image data from Host to GPU memory and vice versa), we get total compression time 2.95 ms. We have chosen the above JPEG encoding parameters because they correspond to so called "visually lossless" compression.

These are performance benchmarks for 2K and 4K images, 24-bit (computations on GPU, without DeviceIO latency, single image mode, no batch, no streaming)

  • Full HD (2K, 1920 × 1080) ~ 15 GByte/s
  • 4K (3840 × 2160) ~ 30 GByte/s

Comparison with the fastest IP Cores for JPEG image compression

The idea about online high speed JPEG compression is not new. There are a lot of different JPEG FPGA implementations for that task. Here are several links for the fastest IP cores on FPGA:

  • Cast Inc. – JPEG-E Baseline JPEG Compression Core with processing rates up to 750 MSamples/s.
  • Alma-Tech (SVE-JPEG-E, SpeedView Enabled JPEG Encoder Megafunction) – IP Core for FPGA Altera/Xilinx with throughput up to 500 MSamples/s.
  • Visengi JPEG Encoder – JPEG / MJPEG Hardware Compressor IP Core with throughput up to 405 Msamples/s on Virtex-5 FPGA.

We've got much better results with GPU, though we understand that GPU is not a solution for all tasks. We consider GPU to be an excellent choice for many high performance applications. It could be also interesting if there are no strict limitations on power consumption and dimensions.

Options for CUDA JPEG image compressor

We can offer fast SDK for GPU image processing. Here you can see some benchmarks for combined debayer and JPEG encoding on NVIDIA GeForce GTX 1080 (timings don't include DeviceIO latency):

  • Debayer DFPD + JPEG compression (quality 90%, subsampling 4:2:0) for Full HD image takes 0.59 ms
  • Debayer DFPD + JPEG compression (quality 90%, subsampling 4:2:0) for 4K image takes 1.37 ms

We have also included our JPEG compression software to fast image processing SDK for high speed and high resolution cameras: bad pixel removal, dark frame subtraction, shading correction, white balance, demosaicing, denoising, color correction, tone mapping, image filtering, LUT, gamma, color management, histogram, resize, crop, rotate, sharp, OpenGL or GLFW output, integration with FFmpeg, bayer compression, J2K encoder, etc.

Licensing for CUDA JPEG Codec

We license CUDA JPEG and other components of GPU Image & Video Processing SDK to software developers, camera manufacturers and resellers, internet providers, software integrators, etc. Our SDK is utilized in wide range of imaging applications. Demo SDK, documentation, licensing info and quotation are available upon request. We are also offering custom software design according to agreed specification. If you need to get significant speed up for your image processing application, don't hesitate to contact us.

More info about CUDA JPEG image compression

Roadmap 2017 for further improvements of CUDA JPEG Codec

     Home                   Contacts                 Site Map
GPU Image Processing