Fast FFmpeg J2K decoder on NVIDIA GPU

Author: Fyodor Serzhenko

FFmpeg is great software which is offering just a huge amount of options for image and video processing, handling of multimedia files and streams. It supports many formats, codecs and filters for various tasks. This is the reason why it's so widespread in the world. Many applications are based on FFmpeg and their flexibility and performance are really impressive. FFmpeg is actually a command-line application which is also capable of video transcoding and video post production. The name of FFmpeg comes from MPEG video standards group, together with "FF" which means "fast forward".

To carry on with FFmpeg, user need to download the software from ffmpeg.org or zeranoe.com. To build own solution, user has to go to Git to download source codes for the latest version and to build FFmpeg with all necessary options.

How FFmpeg can decode J2K?

For a start we could answer the following very simple question - which JPEG2000 codec is working at FFmpeg by default? Surprisingly, this is not OpenJPEG codec. FFmpeg has its own J2K codec. In the FFmpeg documentation we can see the following: "The native jpeg2000 encoder is lossy by default, the -q:v option can be used to set the encoding quality. Lossless encoding can be selected with -pred 1".

This is not a good choice, so we can install OpenJPEG library (libopenjpeg) as default FFmpeg codec for J2K encoding and decoding on CPU. OpenJPEG is quite reliable and sofisticated solution with wide set of features from JPEG2000 Standard. OpenJPEG codec is very interesting product, but it's working on CPU only. As soon as J2K algorithm has very high computational complexity, OpenJPEG is running not fast even with multithreading. OpenJPEG is still very slow even after recent boost with optimization and multithreading. Here you can see JPEG2000 benchmarks on CPU and GPU for J2K encoding and decoding with OpenJPEG, Jasper, Kakadu, J2k-Codec, CUJ2K, Fastvideo codecs to check the performance for images with 2K and 4K resolutions (both for lossy and lossless algorithms).

How FFmpeg is working internally?

FFmpeg usage is based on the idea of consequtive software modules which are applied to your data. As soon as most of FFmpeg codecs and filters are working on CPU, both input and output of each processing module are at CPU memory, though currently FFmpeg is also capable to work with GPU-based NVENC encoder and NVDEC decoder on NVIDIA GPUs. That NVIDIA codec supports H.264 and H.265 codecs and much more.

To create conventional FFmpeg codec for fast J2K encoding or decoding on GPU, we've taken into account architectures of FFmpeg applications and FFmpeg codecs. We've implemented FFmpeg J2K decoder which is working on GPU with batch of images in multithreaded mode to achieve maximum performance. Externally it looks like conventional decoder with internal multithreading. Now that J2K decoder could be utilized in FFmpeg and it could be included in FFmpeg processing workflow as a part of any complicated task.

That FFmpeg decoder is fully based on Fastvideo J2K decoder which is implemented on NVIDIA GPU. That J2K decoder could be used in many FFmpeg applications in a standard way. To follow that standard approach, user just needs to build FFmpeg with that J2K library.

How to build FFmpeg with Fastvideo J2K decoder

1. Download FFmpeg source for Ubuntu. Version 4.2.4 has been used for testing on Ubuntu 18.04
https://ffmpeg.org/download.html#get-sources

You can retrieve the source code through Git by using the following command:
git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg

To get Fastvideo SDK, please send your request via form below at the bottom of that page.

This is the link to download test video file - SNOWMAN-DCP3D.rar

2. Install NVENC headers by install_nvenc.sh. NVIDIA driver is 440.33.01. NVENC version is 9.1.23.
FFmpeg version of headers required to interface with NVIDIA's codec APIs.
Corresponds to NVIDIA Video Codec SDK version 9.1.23.

3. Install yasm package for FFmpeg build

4. Copy folder fastvideo_sdk (including inc and lib folders) in root of FFmpeg source folder.
Copy make_sl.sh from root of archive to fastvideo_sdk/lib and execute it to make all symbolic links for *.so.

5. Copy
Refresh the following files:

- libavcodec/allcodecs.c
extern AVCodec ff_jpeg2000_cuda_decoder;

- libavcodec/avcodec.h
After AV_CODEC_ID_JPEG2000,
Insert AV_CODEC_ID_JPEG2000_CUDA,

- libavcodec/codec_desc.c
After
{
.id = AV_CODEC_ID_JPEG2000,
.type = AVMEDIA_TYPE_VIDEO,
.name = "jpeg2000",
.long_name = NULL_IF_CONFIG_SMALL("JPEG 2000"),
.props = AV_CODEC_PROP_INTRA_ONLY | AV_CODEC_PROP_LOSSY | AV_CODEC_PROP_LOSSLESS,
.mime_types= MT("image/jp2"),
.profiles = NULL_IF_CONFIG_SMALL(ff_jpeg2000_profiles),
},

Insert
{
.id = AV_CODEC_ID_JPEG2000_CUDA,
.type = AVMEDIA_TYPE_VIDEO,
.name = "jp2k_cuda",
.long_name = NULL_IF_CONFIG_SMALL("JPEG 2000 (Fastvideo)"),
.props = AV_CODEC_PROP_INTRA_ONLY | AV_CODEC_PROP_LOSSY | AV_CODEC_PROP_LOSSLESS,
.mime_types= MT("image/jp2"),
.profiles = NULL_IF_CONFIG_SMALL(ff_jpeg2000_cuda_profiles),
},

- libavcodec/profiles.h
extern const AVProfile ff_jpeg2000_cuda_profiles[];

- libavcodec/profiles.c
After
const AVProfile ff_jpeg2000_profiles[] = {
{ FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_0, "JPEG 2000 codestream restriction 0"},
{ FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_1, "JPEG 2000 codestream restriction 1"},
{ FF_PROFILE_JPEG2000_CSTREAM_NO_RESTRICTION, "JPEG 2000 no codestream restrictions"},
{ FF_PROFILE_JPEG2000_DCINEMA_2K, "JPEG 2000 digital cinema 2K"},
{ FF_PROFILE_JPEG2000_DCINEMA_4K, "JPEG 2000 digital cinema 4K"},
{ FF_PROFILE_UNKNOWN },
};

Insert
const AVProfile ff_jpeg2000_cuda_profiles[] = {
{ FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_0, "JPEG 2000 codestream restriction 0"},
{ FF_PROFILE_JPEG2000_CSTREAM_RESTRICTION_1, "JPEG 2000 codestream restriction 1"},
{ FF_PROFILE_JPEG2000_CSTREAM_NO_RESTRICTION, "JPEG 2000 no codestream restrictions"},
{ FF_PROFILE_JPEG2000_DCINEMA_2K, "JPEG 2000 digital cinema 2K"},
{ FF_PROFILE_JPEG2000_DCINEMA_4K, "JPEG 2000 digital cinema 4K"},
{ FF_PROFILE_UNKNOWN},
};

- libavcodec/Makefile
After
OBJS-$(CONFIG_JPEG2000_DECODER) += jpeg2000dec.o jpeg2000.o jpeg2000dsp.o \
Insert
jpeg2000dec_cuda.o \

Refresh the following files to install fastvideo resizer for 10/16 bits video:

- libavfilter/allfilters.c
extern AVFilter ff_vf_scale_fastvideo;

- libavfilter/Makefile
OBJS-$(CONFIG_SCALE_FASTVIDEO_FILTER) += vf_scale_fastvideo.o

6. Copy src/include folder to fastvideo_sdk/inc. Copy src/libavcodec folder to libavcodec to install j2k decoder from Fastvideo. Copy src/libavfilter folder to libavfilter to install resize filter.

7. Configure FFmpeg with listed below minimum options. This list can be extended by end user. CUDA path is default for 10.1 version.
./configure --cc="gcc -m64" --enable-ffplay --enable-ffmpeg --disable-doc --enable-shared --disable-static --enable-cuda --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --prefix=./bin/ --arch=amd64 --extra-cflags="-MD -I/usr/local/cuda-10.1/include/ -I./fastvideo_sdk/inc/" --extra-ldflags="-L/usr/local/cuda-10.1/lib64/ -L./fastvideo_sdk/lib/" --extra-libs="-lcudart -lfastvideo_sdk -lfastvideo_j2kFfmpegWrapper -lfastvideo_decoder_j2k"

Or you can copy default.configure.sh from root of archive to root of FFmpeg folder and run it.

8. make

9. make install

10. Update LD_LIBRARY_PATH for FFmpeg and Fastvideo libraries

Or copy export.library.path.sh from root of archive to FFmpeg folder and run it. Script prints export LD_LIBRARY_PATH command with correct path.

11. Copy video folder to bin folder of FFmpeg and run run.snowman.sh to test.

Fastvideo J2K decoder parameters: threads and fv_batch_size

Fastvideo J2K decoder on GPU for FFmpeg has two additional parameters that influence on the performance. These are -threads and -fv_batch_size.

Parameter "threads" is FFmpeg parameter. It defines the number of concurrent CPU threads for processing. This option is accessible for J2K decoder with frame-level multithreading.

Parameter "fv_batch_size" defines the number of frames, processed by one decoder in parallel. FFmpeg does not support batch mode for multiple decoders. FFmpeg supports batch mode only for a single decoder. This is not enough for Fastvideo J2K decoder to get the best performance.

To discard this limitation, Fastvideo JPEG2000 decoder for FFmpeg uses internal client-server architecture. Client is FFmpeg worker thread that takes bytestream from FFmpeg and sends it to J2K decoder. The amount of real J2K decoders is a number of FFmpeg worker threads divided to batch size. Therefore, the number of worker threads has to be divisible by batch size (fv_batch_size).

To get the best performance, batch size (fv_batch_size) has to be at least 4 and the number of worker threads has to be at least 8. This results in two real J2K decoders. If we increase fv_batch_size and the number of J2K decoders, we improve GPU memory usage.

Fast J2K transcoding with FFmpeg and NVENC from MXF to MP4

These are examples of command-line how we could decode snowman.mxf file from the current folder and create MP4 video file with H.264 or H.265 encoding at the same pipeline with different sets of parameters:


#source: XYZ 12 bit -> dest: h265 444 10 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 1 -fv_batch_size 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.444.10bits.mp4
#source: XYZ 12 bit -> dest: h265 444 10 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.444.10bits.mp4
#source: XYZ 12 bit -> dest: h265 420 10 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_420 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.420.10bits.mp4
#source: XYZ 12 bit -> dest: h265 444 8 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.444.8bits.mp4
#source: XYZ 12 bit -> dest: h265 420 8 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -fv_convert_to_420 1 -i snowman.mxf -c:v hevc_nvenc -b:v 5M out.hevc.420.8bits.mp4
#source: XYZ 12 bit -> dest: h264 444 8 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -i snowman.mxf -c:v h264_nvenc -b:v 5M out.h264.444.8bits.mp4
#source: XYZ 12 bit -> dest: h264 420 8 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -threads 4 -fv_batch_size 2 -fv_convert_to_8bit 1 -fv_convert_to_420 1 -i snowman.mxf -c:v h264_nvenc -b:v 5M out.h264.420.8bits.mp4

Basically, we read and parse frames from snowman.mxf file, decode them on GPU with id = 0 (fv_batch_size = 2, four CPU threads) and encode that stream to MP4 at 5 Mbit/s and save it to *.mp4 file in the current folder.

fv_batch_size and number of threads depend on the size of free GPU memory. If utilized parameters are too big, then user will get a warning to make fv_batch_size less or to utilize better GPU with more memory.

Simple benchmarks for FFmpeg J2K transcoding to H.264 on GPU

The task of J2K transcoding to H.264 is quite common. Though it's not possible to get realtime performance with OpenJPEG codec from FFmpeg. Fastvideo JPEG2000 decoder together with NVIDIA NVENC could solve the full task of J2K transcoding on GPU and it will be much faster than realtime. Resulted performance depends on many factors, but here we just indicate a standard case:

According to our preliminary benchmarks on NVIDIA GeForce RTX 2080ti and on Quadro RTX 6000, such a solution can transcode MXF (J2K, 10-bit, 4:2:2, 200 Mbit/s, TR-01 compliant) or TS files/streams to MP4 (H.264, 15 Mbit/s) with the performance around 320-350 fps. Full processing is done on GPU (apart from audio processing which is on CPU), both for J2K decoding (Fastvideo J2K codec) and H.264 encoding (NVENC).

Fast J2K decoding with FFmpeg from MXF to RGB or YUV frames

Fastvideo J2K decoder supports multiple output formats. These are NV12, P010, YUV444, YUV444P10, RGB24, RGB48. Formats NV12, P010, YUV444, YUV444P10 are native for NVENC. By default, decoded frame is placed into device buffer that can not be consumed by most of FFmpeg filters and codecs. There is a parameter -fv_export_to_host 1 to force J2K decoder to place a frame to host buffer. Device buffer is used for NVENC to remove additional device-to-host and host-to-device copies. Host buffer is used for integration with other FFmpeg codecs and filters. Formats NV12, P010, YUV444, YUV444P10 support both buffer types. Formats RGB24 and RGB48 support only host buffer type.

Format NV12 is native NVENC format. It contains mixed UV plane in contrast to classic YUV420. Format P010 is 16-bit per element version of NV12 format.

This is an example of how we could decode snowman.mxf file from the current folder and create a series of RGB or YUV images:


#source: XYZ 12 bit -> dest: RGB 8 bits

  ./ffmpeg -y -report -c:v jp2k_cuda -fv_convert_to_8bit 1 -fv_convert_to_rgb 1 -fv_export_to_host 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf out%d.8.t1.b1.ppm
#source: XYZ 12 bit -> dest: YUV 444 10 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.yuv444.10.yuv
#source: XYZ 12 bit -> dest: YUV 444 8 bits

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -fv_convert_to_8bit 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.yuv444.8.yuv
#source: XYZ 12 bit -> dest: YUV P010 

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -fv_convert_to_420 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.P010.yuv
#source: XYZ 12 bit -> dest: YUV NV12

  ./ffmpeg -y -report -loglevel debug -c:v jp2k_cuda -fv_export_to_host 1 -fv_convert_to_420 1 -fv_convert_to_8bit 1 -threads 1 -fv_batch_size 1 -ss 00:00:09 -i snowman.mxf -c rawvideo -f segment -segment_time 0.01 out%d.NV12.yuv

We could possibly need such a solution if we are going to do final video encoding on CPU. If we compare CPU-based H.264 or H.265 encoding performance with J2K on GPU, we can see that performance of J2K decoding is much higher, so we could decode multiple streams on GPU and then encode them on CPU. Usually we will need one CPU thread per stream for encoding. Multicore CPU is a must here. This is actually a task of live transcoding where we could combine GPU and CPU to build high performance solution.

If we don't have an output format that you need, please let us know and we will add it. We do both J2K decoding and format conversions on GPU to improve the total performance. This is very important in processing of multiple streams.

Fast J2K decoding with FFmpeg for MXF Player

If you have ever tried to play MXF or TS files (150-200 Mbit/s) with J2K content at VLC player, you probably know the result. Unfortunately, any high-bitrate MXF or TS video with J2K frames is too complicated for CPU-based VLC software and you could hardly achieve viewing at 1 fps, which is not acceptable.

DCP package contains J2K frames inside and it's quite difficult task to offer smooth preview for that content on CPU via VLC. Now you can decode J2K frames on GPU and show the results via ffplay or with any other player which is connected to FFmpeg output.

Apart from J2K decoding on GPU, we also have FFmpeg-based J2K encoder on GPU which could be utilized to create DCP with very high performance. It's much faster in comparison with OpenJPEG.