Benchmarks for J2K decoders on CPU and GPU
Below we provide the benchmarks for Fastvideo JPEG2000 Decoder on CPU/GPU in comparison with other freely available open source J2K decoding solutions on CPU.
Approaches for JPEG2000 performance measurements
There are two standard approaches to performance measurements of JPEG2000 codecs, which utilize GPU. They correspond to the two most common use cases for J2K decoders.
1. Single image mode consists in processing of single image at a time and could be called "latency-oriented" or "low latency" approach. In that case the time interval (latency) between availability of original image in RAM and availability of the processed image in RAM is measured. It means that software cannot expect that any additional images will be processed at the same time and therefore cannot take advantage of multiple image decoding. Overlapping of current image processing with other activities is undesirable because it would increase total latency.
2. Batch mode consists in processing of batch of images and could be called "throughput-oriented" or "maximum performance". In that case frame rate becomes more important than latency. It is calculated via division of the total time of processing by the number of processed images. Some JPEG2000 codecs are optimized for this use case, meaning that exploiting of task parallelism leads to better frame rate (throughput) at the expense of increased processing time for separate images. It is possible, because we actually have three devices (CPU, GPU and bus interface between them), which can be used simultaneously in that mode, whereas at single image mode these devices are used sequentially for different stages of JPEG2000 algorithm. Moreover, GPU can process several images simultaneously to increase frame rate even more, if each image is too small for decoder to load a multitude of GPU cores (especially at Tier-1 stage). Important limitation for simultaneous processing of several images is imposed by amount of free GPU memory. Batch mode is a must for streaming applications when the pipeline contains JPEG2000 decoder. For more complicated workflow it could be better to utilize single image mode, though fps will be reduced.
Briefly, JPEG2000 decoder at batch mode can take into account specific methods of task parallelism, based on the following:
CPU-based JPEG2000 solutions have no explicit implementation of batch mode, because all processing stages are done on CPU and complete loading of available CPU cores can be achieved by simply running multiple decoders in separate processes. Multithreaded mode of CPU-based JPEG2000 decoders decrease latency of single image processing, so we can consider this mode as single image mode.
At the moment we don't consider here the following possible modes for JPEG 2000 benchmarking on GPU:
Results for all these modes will be published as soon as their implementations are ready.
We don't hide anything concerning benchmarking procedures and the achieved results. Thus, everyone can always reproduce our benchmarks, because we publish not only timing and performance, we supply full info about hardware, JPEG2000 parameters, test images and testing modes.
JPEG 2000 decoding benchmarks
We've carried out time and performance measurements for JPEG2000 decoding for 24-bit images with 2K and 4K resolutions. All results don't include any host I/O latency (image loading to RAM from HDD/SSD and saving back) and we've also excluded host-to-device transfer time. We've done such an assumption to reproduce J2K decoder usage in our conventional image processing pipeline, when decompressed data reside in GPU memory. Results for GPU-based JPEG2000 decoding software also include Tier-2 time on CPU, because this stage in our implementation is performed on CPU. In the tables below, one can find averaged results for the best series of 100 measurements.
Hardware and software
JPEG2000 Decoders for comparison
J2K decoding at single image mode for 2K image with lossy compression: 2k_wild_lossy.jp2 (1920×1080, 4:4:4, 24-bit)
J2K decoding at single image mode for 4K image with lossy compression: 4k_wild_lossy.jp2 (3840×2160, 4:4:4, 24-bit)
MB/s – MegaBytes per second
J2K decoding at single image mode for 2K image with lossless compression: 2k_wild_lossless.jp2 (1920×1080, 4:4:4, 24-bit)
J2K decoding at single image mode for 4K image with lossless compression: 4k_wild_lossless.jp2 (3840×2160, 4:4:4, 24-bit)
MB/s – MegaBytes per second
Superior performance of JPEG 2000 decoding at batch mode
For batch mode we've carried out performance measurements for JPEG 2000 decoding exactly with the same parameters as we used at single image mode. In the table below, you can find averaged results for the best series of measurements (each lasting 10 seconds). All results don't include host I/O latency (image loading to RAM from HDD/SSD and saving back).
To get maximum performance at batch mode, we don't need very large images as for single image mode. For example, 4K image contains 4 times more pixels compared to 2K. It means that at batch mode we can expect that decoding time for 2K will be 4 times less than for 4K. In theory, if at single image mode we can do JPEG2000 decoding for 2K at 36 fps and for 4K at 14 fps, then it could be possible to achieve frame rate 14*4=56 fps for 2K decoding speed by processing 4 images with 2K resolution simultaneously, as a batch. We could expect even higher speedup using greater batch size, since at single image mode GPU is not completely occupied with 4K images and batch size for 2K is more than 4. If we also take into account simultaneous processing on both CPU and GPU, which is possible at batch mode, one could get additional acceleration for J2K decoding.
JPEG2000 decoding benchmarks at batch mode
We have published all info concerning time measurements, together with sample images, JPEG2000 parameters and hardware specifications to offer everyone an opportunity to reproduce our results and to check performance measurements of other J2K decoders at the same testing conditions.