CUDA Denoiser filter for camera software

Denoising is widely used in many camera applications, especially in low-light solutions. We have developed several CUDA-accelerated denoising kernels that run on existing NVIDIA hardware on Windows/Linux/ARM. We've achieved very high performance for both image and video denoisers on CUDA.

CUDA Denoiser Library Features

  • Input format: 8/10/12/14/16-bit per channel input data array from CPU or GPU memory
  • Output format: 24/48-bit output data array in CPU or GPU memory
  • Denoising with 16/32-bit accuracy
  • High speed denoiser without AI
  • Denoiser algorithms
    • Wavelet denoiser (raw and rgb) CDF 5/3 and CDF 9/7 with Hard, Soft, Garrote thresholding
    • Bilateral denoiser
    • NLM denoiser
  • Compatibility with FastVCR software for machine vision cameras
  • Timing and performance measurements
  • OS Windows-10/11, Linux Ubuntu, and L4T (Jetson)
  • Compatibility with NVIDIA GPUs (Jetson, GeForce, Quadro, Tesla), cc >=5.0, CUDA-12.6
cuda denoiser filters

Benchmarks for CUDA Denoiser

Image resolution: 4112×2176 (8.9 MPix), 16-bit per channel, RGB

Test description: all data in GPU memory, timing includes CUDA computations only

2D Wavelet transform: CDF 9/7
Number of DWT resolutions: up to 7
DWT thresholds for YCbCr: 80;150;150

NLM denoiser parameters: blur window 3×3 and more, search window 3×3 and more, strength 1-3000
That algorithm could work with internal 4:4:4 or 4:2:0 subsampling
NLM could also have independent denosing parameters for Y and Cb/Cr channels for 4:2:0 and 4:4:4 subsampling modes

NLM denoiser parameters for testing: blur window 3×3, search window 5×5, strength 500
Bilateral denoiser parameters for testing: diameter 3, sigmaColor 5, sigmaSpace 500

Software: OS Windows-10/11, CUDA-12.6
Hardware: NVIDIA GeForce RTX 4090

  • RAW DWT denoiser - 1.8 ms (4.9 GPix/s)
  • DWT denoiser (YCbCr, 4:4:4) - 3.05 ms (2.9 GPix/s)
  • NLM denoiser (RGB) - 0.19 ms (40 GPix/s)
  • NLM denoiser (YCbCr, 4:2:0) - 0.20 ms (40 GPix/s)
  • NLM denoiser (YCbCr, 4:4:4) - 0.37 ms (21 GPix/s)
  • Bilateral denoiser (RGB) - 0.13 ms (61 GPix/s)

The above results show super fast performance and are comparable to the processing time of our best MG debayer algorithm, which is about 0.6 ms (13 GPix/s) for the same image on this GPU. Our denoisers used to be much slower than this demosaicing algorithm.

We have developed this software as part of our GPU Image & Video Processing SDK. Now our customers can use these CUDA-accelerated denoisers in their applications as part of their image processing pipeline.

Testing

To test our CUDA denoiser filters, please download Fast VCR software which is capable of working not only with machine vision cameras at real time, but also with RAW images from SSD. This is a real test to evaluate image quality and performance.

CUDA denoiser roadmap

  • Acceleration of Bilateral denoiser - done
  • YUV denoiser filter on CUDA for FFmpeg - in progress
  • Temporal denoiser filter on CUDA - in progress

Contact Form

This form collects your name and email. Check out our Privacy Policy on how we protect and manage your personal data.