Jpeg2Jpeg Acceleration with CUDA MPS on Linux

Author: Fyodor Serzhenko

The task of fast JPEG-to-JPEG Resize is essential for high load web services. Users create most of their images with smart phones and cameras in JPEG format which is the most popular nowadays. To offer high quality services and to cut expences on storages, providers strive to implement JPEG resize on-the-fly to store just one image instead of several dozens in different resolutions.

Solutions for fast JPEG resize, which are also called Jpeg2Jpeg, have been implemented on CPU, GPU, FPGA and on mobile platforms as well. The highest performance for that task was demonstrated on GPU and FPGA, which used to be considered on par for NVIDIA Tesla T4 and Xilinx VCU1525 or Alveo U280 hardware.

Bottlenecks for high performance Jpeg2Jpeg solutions on GPU

Implementation of fast JPEG Resize (Jpeg2Jpeg transform) is quite complicated task and it's not easy to boost highly optimized solution. Nevertheless, we can point out some issues which still could be improved:

better GPU utilization
batch mode implementation
performance optimization of JPEG decoder (this is the slowest part of the pipeline)

In general, GPU can offer super high performance only in the case if there is sufficient amount of data for parallel processing. If we don't have enough data, then GPU occupancy is low and we are far from maximum performance. We have exactly the same issue with the task of JPEG Resize on GPU: usually we have non-sufficient amount of data and GPU occupancy is not high. One way to solve that matter is to implement batch mode.

Batch mode implies that the same algorithm could be applied at the same time to many items which we need to process. This is not exactly the case with JPEG Resize, because it should be possible to implement batch processing for JPEG decoding, though Resize is difficult to include at the same batch as soon as for each image we need to generate and to utilize individual sets of interpolation coefficients. And scaling ratio is usually not the same for all processed images in the batch. That's why batch mode could be limited by JPEG decoding only. If we have a look at existing FPGA-based solutions for Jpeg2Jpeg software, all of them are utilizing batch JPEG decoding and then individual resize and encoding to get better performance.

Optimization of JPEG decoder could be done by taking into account the latest NVIDIA architecture to boost the performance of entropy decoder, which is the most time-consuming part of JPEG algorithm. Apart from entropy decoder optimization, it makes sense to accelerate all other parts or JPEG algorithm.

Finally, we've found the way how to accelerate the current version of JPEG Resize on GPU from Fastvideo Image Processing SDK and this is the answer: NVIDIA CUDA MPS. Below we condiser in detail what's CUDA MPS and how we could utilize it for that task.

CUDA Multi-Process Service

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (CUDA API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) GPUs. Hyper-Q allows CUDA kernels to be processed concurrently on the same GPU. This can benefit performance when the GPU compute capacity is underutilized by a single application process.

MPS is a binary-compatible client-server runtime implementation of the CUDA API, which consists of several components:

Control Daemon Process: the control daemon is responsible for starting and stopping the server, as well as coordinating connections between clients and servers.
Client Runtime: the MPS client runtime is built into the CUDA Driver library and may be used transparently by any CUDA application.
Server Process: the server is the clients' shared connection to the GPU and provides concurrency between clients.

To balance workloads between CPU and GPU tasks, MPI processes are often allocated individual CPU cores in a multi-core CPU machine to provide CPU-core parallelization of potential Amdahl bottlenecks. As a result, the amount of work each individual MPI process is assigned may underutilize the GPU when the MPI process is accelerated using CUDA kernels. While each MPI process may end up running faster, the GPU is being used inefficiently. The Multi-Process Service takes advantage of the inter-MPI rank parallelism, increasing the overall GPU utilization.

NVIDIA Volta architecture has introduced new MPS capabilities. Compared to MPS on pre-Volta GPUs, Volta MPS provides a few key improvements:

Volta MPS clients submit work directly to the GPU without passing through the MPS server.
Each Volta MPS client owns its own GPU address space instead of sharing GPU address space with all other MPS clients.
Volta MPS supports limited execution resource provisioning for Quality of Service (QoS).

cuda mps architecture for pascal and volta gpu

Fig.1. Pascal and Volta MPS architectures (picture from NVIDIA MPS Documentation)

CUDA MPS Benefits

GPU utilization: a single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.
Reduced on-GPU context storage: without MPS each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients. Volta MPS supports increased isolation between MPS clients, so the resource reduction is to a much lesser degree.
Reduced GPU context switching: without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one set of scheduling resources between all of its clients, eliminating the overhead of swapping when the GPU is scheduling between those clients.

CUDA MPS Limitations

MPS is only supported on the Linux operating system. The MPS server will fail to start when launched on an operating system other than Linux.
MPS is not supported on NVIDIA Jetson platforms. The MPS server will fail to start when launched on Jetson platforms.
MPS requires a GPU with compute capability version 3.5 or higher. The MPS server will fail to start if one of the GPUs visible after applying CUDA_VISIBLE_DEVICES is not of compute capability 3.5 or higher.
The Unified Virtual Addressing (UVA) feature of CUDA must be available, which is the default for any 64-bit CUDA program running on a GPU with compute capability version 2.0 or higher. If UVA is unavailable, the MPS server will fail to start.
The amount of page-locked host memory that can be allocated by MPS clients is limited by the size of the tmpfs filesystem (/dev/shm).
Exclusive-mode restrictions are applied to the MPS server, not to MPS clients.
Only one user on a system may have an active MPS server.
The MPS control daemon will queue MPS server activation requests from separate users, leading to serialized exclusive access of the GPU between users regardless of GPU exclusivity settings.
All MPS client behavior will be attributed to the MPS server process by system monitoring and accounting tools (e.g. nvidia-smi, NVML API).

GPU Compute Modes

Three Compute Modes are supported via settings accessible in nvidia-smi.

PROHIBITED – the GPU is not available for compute applications.
EXCLUSIVE_PROCESS – the GPU is assigned to only one process at a time, and individual process threads may submit work to the GPU concurrently.
DEFAULT – multiple processes can use the GPU simultaneously. Individual threads of each process may submit work to the GPU simultaneously.

Using MPS effectively causes EXCLUSIVE_PROCESS mode to behave like DEFAULT mode for all MPS clients. MPS will always allow multiple clients to use the GPU via the MPS server.

When using MPS, it is recommended to use EXCLUSIVE_PROCESS mode to ensure that only a single MPS server is using the GPU, which provides additional insurance that the MPS server is the single point of arbitration between all CUDA processes for that GPU.

Client-Server Architecture

This diagram shows a likely schedule of CUDA kernels when running an MPI application consisting of multiple OS processes without MPS. Note that while the CUDA kernels from within each MPI process may be scheduled concurrently, each MPI process is assigned a serially scheduled time-slice on the whole GPU.

Fig.2. Multi-Process Sharing GPU without MPS (picture from NVIDIA MPS Documentation)

Fig.3. Multi-Process Sharing GPU with MPS (picture from NVIDIA MPS Documentation)

When using pre-Volta MPS, the server manages the hardware resources associated with a single CUDA context. The CUDA contexts belonging to MPS clients funnel their work through the MPS server. This allows the client CUDA contexts to bypass the hardware limitations associated with time sliced scheduling, and permit their CUDA kernels execute simultaneously.

Volta provides new hardware capabilities to reduce the types of hardware resources the MPS server must managed. A client CUDA context manages most of the hardware resources on Volta, and submits work to the hardware directly. The Volta MPS server mediates the remaining shared resources required to ensure simultaneous scheduling of work submitted by individual clients, and stays out of the critical execution path.

The communication between the MPS client and the MPS server is entirely encapsulated within the CUDA driver behind the CUDA API. As a result, MPS is transparent to the MPI program. MPS clients CUDA contexts retain their upcall handler thread and any asynchronous executor threads. The MPS server creates an additional upcall handler thread and creates a worker thread for each client.

Server

The MPS control daemon is responsible for the startup and shutdown of MPS servers. The control daemon allows at most one MPS server to be active at a time. When an MPS client connects to the control daemon, the daemon launches an MPS server if there is no server active. The MPS server is launched with the same user id as that of the MPS client.

If there is an MPS server already active and the user id of the server and client match, then the control daemon allows the client to proceed to connect to the server. If there is an MPS server already active, but the server and client were launched with different user id’s, the control daemon requests the existing server to shutdown once all its clients have disconnected. Once the existing server has shutdown, the control daemon launches a new server with the same user id as that of the new user's client process.

The MPS control daemon does not shutdown the active server if there are no pending client requests. This means that the active MPS server process will persist even if all active clients exit. The active server is shutdown when either a new MPS client, launched with a different user id than the active MPS server, connects to the control daemon or when the work launched by the clients has caused an exception.

The control daemon executable also supports an interactive mode where a user with sufficient permissions can issue commands, for example to see the current list of servers and clients or startup and shutdown servers manually.

Client Attach/Detach

When CUDA is first initialized in a program, the CUDA driver attempts to connect to the MPS control daemon. If the connection attempt fails, the program continues to run as it normally would without MPS. If however, the connection attempt succeeds, the MPS control daemon proceeds to ensure that an MPS server, launched with same user id as that of the connecting client, is active before returning to the client. The MPS client then proceeds to connect to the server.

All communication between the MPS client, the MPS control daemon, and the MPS server is done using named pipes and UNIX domain sockets. The MPS server launches a worker thread to receive commands from the client. Upon client process exit, the server destroys any resources not explicitly freed by the client process and terminates the worker thread.

Important Application Considerations

The NVIDIA VIDEO Codec SDK is not supported under MPS on pre-Volta MPS clients.
Only 64-bit applications are supported. The MPS server will fail to start if the CUDA application is not 64-bit. The MPS client will fail CUDA initialization.
If an application uses the CUDA driver API, then it must use headers from CUDA 4.0 or later (i.e. it must not have been built by setting CUDA_FORCE_API_VERSION to an earlier version). Context creation in the client will fail if the context version is older than 4.0.
Dynamic parallelism is not supported. CUDA module load will fail if the module uses dynamic parallelism features.
MPS server only supports clients running with the same UID as the server. The client application will fail to initialize if the server is not running with the same UID.
Stream callbacks are not supported on pre-Volta MPS clients. Calling any stream callback APIs will return an error.
CUDA graphs with host nodes are not supported under MPS on pre-Volta MPS clients.
The amount of page-locked host memory that pre-Volta MPS client applications can allocate is limited by the size of the tmpfs filesystem (/dev/shm). Attempting to allocate more page-locked memory than the allowed size using any of relevant CUDA APIs will fail.
Terminating an MPS client without synchronizing with all outstanding GPU work (via Ctrl-C / program exception such as segfault / signals, etc.) can leave the MPS server and other MPS clients in an undefined state, which may result in hangs, unexpected failures, or corruptions.

Performance measurements for Jpeg2Jpeg application

For software testing we've utilized the following scenarious:

Source 24-bit RGB images with JPEG quality 90%, subsampling 4:2:0, restart interval 1
Initial image resolution: 1920×1080 (2K) or 1280×720 (1K)
2K resize: 1920×1080 to 480×270
1K resize: 1280×720 to 320×180
Output JPEG compression: quality 90%, subsampling 4:2:0, restart interval 10

Hardware and software

CPU Intel Core i7-5930K (Haswell-E, 6 cores, 3.5–3.7 GHz)
NVIDIA Quadro GV100
Linux Ubuntu 18.04 and CUDA-10.0
Fastvideo SDK 0.14.2.4

These are main components of Jpeg2Jpeg software

Server is responsible for image processing on GPU.
Client is responsible for image read from disk, image send for processing, storing the processed images after convertion. We need at least two Clients per Server to hide load/store operations.

If we are working with CUDA MPS activated, then the total number of processes in Jpeg2jpeg software is limited by the amount of available CPU cores.

To check CUDA MPS mode, we executed the following commands

nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

Then we started 2/4/6 daemons of JPEG Resize application on NVIDIA Quadro GV100 GPU.

We've also done the same without CUDA MPS to make a comparison.

How we measured the performance

To get reliable results which have good correspondence with JPEG Resize algorithm parameters, for each test we've utilized the same image and the same parameters for resizing and encoding. We've repeated each series 1,000 times and calculated average FPS (number of frames per second) for processing. Speedup is calculated as the current value of "FPS with MPS" divided to the best value from "FPS without MPS" column.

Jpeg2Jpeg performance with and without MPS for 1K JPEG Resize from 1280×720 to 320×180

Source image	Servers	Clients per Server	FPS without MPS	FPS with MPS	Speedup
1K	2	1	837	1198	1.4
1K	2	2	815	1633	2.0
1K	4	1	815	2326	2.9
1K	4	2	813	2581	3.2
1K	6	1	795	2857	3.4
1K	6	2	805	2871	3.4

Jpeg2jpeg performance with and without MPS for 2K JPEG Resize from 1920×1080 to 480×270

Source image	Servers	Clients per Server	FPS without MPS	FPS with MPS	Speedup
2K	2	1	769	1124	1.4
2K	2	2	762	1368	1.8
2K	4	1	761	1826	2.4
2K	4	2	748	1975	2.8
2K	6	1	769	2143	2.8
2K	6	2	696	2087	2.7

We see that performance saturation in that task could probably be connected with the number of utilized CPU cores. We will check the performance on multicore Intel Xeon CPU to find the solution with the best balance between CPU and GPU to achieve maximum acceleration for Jpeg2Jpeg application. This is essentially heterogeneous task and all hardware components should be carefully chosen and thoroughly tested.

Jpeg 2 Jpeg acceleration benchmarks for CUDA MPS on Linux

We've been able to boost the Jpeg2Jpeg software with CUDA MPS on Linux significantly. According to our time measurements, total performance for JPEG Resize application at CUDA MPS mode was increased by 2.8–3.4 times, which is difficult to believe. We have been able to accelerate the solution that was already well-optimized and it was one of the fastest on the market. For standard use cases on NVIDIA Quadro GV100 we've got benchmarks around 760–830 fps (images per second) and with CUDA MPS at the same test conditions and at the same hardware we've reached 2140–2870 fps.

Such an impressive performance boost is absolutely astonishing and we've checked that many times. It's working well and very fast. Moreover, we have fair chances to get even better acceleration by utilizing more powerful multicore CPU.

GPU and FPGA solutions for Jpeg2Jpeg applications were on par recently, but this is not the case anymore. Now NVIDIA GPU with Jpeg2Jpeg software from Fastvideo have left behind both CPU and FPGA solutions.

There is a use case where Jpeg2Jpeg transform on GPU could be much more efficient in comparison with CPU and FPGA solutions which are quite good if we need to create thumbnails from JPEG images. If we need to implement JPEG resize where image width and height should be changed to a small extent, then Jpeg2Jpeg perfromance on GPU will be much better then on CPU/FPGA.

References

Documentation for NVIDIA CUDA MPS
Web Resize on-the-fly: one thousand images per second on Tesla V100 GPU
JPEG Resize on-demand: FPGA vs GPU. Which is the fastest?
GTC China 2020 - Improve GPU utilization from system level (Test Case 3, slides 33-35)