In this article, we investigate using multi-core graphics processing units (GPUs) for video encoding and decoding. After an overview of video coding and GPUs, we review some previous work on structuring video coding modules so that the massive parallel processing capability of GPUs can be harnessed. We also review previous work on partitioning the video decoding flow between the central processing unit (CPU) and GPU. After that, we discuss in detail a GPU based fast motion estimation to illustrate some design considerations in using GPUs for video coding, and the tradeoff between speedup and rate-distortion performance. Our results highlight the importance to expose as much data parallelism as possible in designing algorithms for GPUs.
Today, video coding – has become the central technology in a wide range of applications. Some of these include digital TV, DVD, Internet streaming video, video conferencing, distance learning, and surveillance and security . A variety of video coding standards and algorithms have been developed (e.g., H.264/AVC , VC-1 , MPEG2 , AVS ) to address the requirements and operating characteristics of different applications. With the prevalent applications of video coding technologies, it is important to investigate efficient implementation of video coding systems on different computing platforms and processors , .
Recently, Graphics Processing Units (GPUs) have emerged as co-processing units for Central Processing Units (CPUs) to accelerate various numerical and signal processing applications , –. Modern GPUs may consist of hundreds of highly decoupled processing cores capable of achieving immense parallel computing performance. For example, the NVIDIA GeForce 8800 GTS processor has 96 individual stream processors each running at 1.2 GHz . The stream processors can be grouped together to perform Single Instruction Multiple Data (SIMD) operations suitable for arithmetic intensive applications. With the advances in GPU programing tools such as thread computing and C programming interface , , GPUs can be efficiently utilized to perform a variety of processing tasks in addition to conventional vertex and pixel operations.
With many personal computers (PCs) or game consoles equipped with multi-core GPUs capable of performing general purpose computing, it is important to study how GPU can be utilized to assist the main CPU in computation-intensive tasks such as video compression/decompression . In fact, as high-definition (HD) contents are getting popular, video coding would require more and more computing power. Therefore, leveraging the computing power of GPU could be a cost-effective approach to meet the requirements of these applications. Note that with dozens of available video coding standards (H.264, MPEG-2, AVS, VC-1, WMV, DivX) it is advantage to pursue a flexible solution based on software.
Focusing on software-based video coding applications running on PCs or game consoles equipped with both CPUs and GPUs, this article investigates how GPUs can be utilized to accelerate video encoding/decoding. Recent work has proposed to apply multi-core GPU/CPU for various video/image processing applications. Table I summarizes some of them. In this article, we survey prior work on video encoding and decoding to illustrate the challenges and advantages of GPU implementation. Specifically, we discuss previous work on GPU-based motion estimation, motion compensation and intra prediction. Our focus is on how the algorithms can be designed to harness the massive parallel processing capability of GPU. In addition, we discuss previous work on partitioning the decoding flow between CPU and GPU (For completeness, we also report the speedup results in previous work. However, since the GPU/multi-core software/hardware technologies have evolved dramatically over the last few years, some of the results could be outdated). After that, we investigate a GPU based fast motion estimation. We discuss some strategy to break dependency between different data units, and examine the tradeoff between speedup and coding efficiency.
The rest of this article is organized as follows. We first provide an overview of the state of the art in video coding and GPUs. We also discuss the challenges to use GPUs to assist video coding. Then, we review previous work on GPUaccelerated video coding. After that, we study GPU based fast motion estimation. Finally, this article ends with concluding remarks.
A. Video coding
The latest video coding standards have achieved state-ofthe-art coding performance. For example, H.264/AVC, which is the latest international video coding standard approved by ITU-T and ISO/IEC, typically requires 60% or less of the bit rate compared to previous standards in order to achieve the same reconstruction quality . Other advanced video coding algorithms, such as AVS-Video developed by the Audio and Video Coding Standard Working Group of China , or VC1 initially developed by Microsoft , have also achieved competitive compression performance. In the following we provide an overview on H.264 video coding standard.
H.264 video coding standard is designed based on the block-based hybrid video coding approach , , which has been used since earlier video coding standards. The coding algorithm exploits spatial correlation between neighboring pixels of the same picture. In addition, it also exploits temporal correlation between neighboring pictures in the input video sequence to achieve compression. Figure 1 depicts the encoder block diagram. The input picture is partitioned into different blocks, and each block may undergo intra prediction using neighboring reconstructed pixels in the same frame as predictor. H.264 supports intra prediction block sizes of 16 × 16, 8 × 8 and 4 × 4, and allows different ways to construct the prediction samples from the adjacent reconstructed pixels. Alternatively, the input block may undergo inter prediction using the reconstructed blocks in the reference frames as predictor. Inter prediction can be based on partition size of 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, or 4×4. Displacement between the current block and the reference block can be up to quarter-pel accuracy and is signaled by the motion vector and the reference picture index .