Design and Implementation of MPEG4 Encoder Based on TMS320DM642

This article refers to the address: http://
1 Introduction

In recent years, with the development of network and multimedia technologies, the importance and demand of video information communication has increased dramatically, and the key is the application of video compression coding technology. Literature [1] has proposed a video coding scheme based on TMS320DM642 DSP, which implements the H.264 algorithm. Compared with H.264, MPEG4 has the advantages of low software and hardware development cost and easier implementation, and is currently the mainstream of video coding applications. This paper proposes an implementation method of MPEG4 video encoder based on TMS320DM642 DSP, which can be used in remote video surveillance, video conferencing and many other fields.
MPEG4 is an international video compression coding standard developed by the International Motion Picture Experts Group (MPEG). It has been developed into an efficient compression algorithm and tool that can adapt to different transmission bandwidths and obtain the best quality images with the least amount of data. MPEG uses algorithms such as DCT, quantization, and entropy coding to eliminate the temporal and spatial correlation of image data by analyzing the shape, motion, texture, etc. It has the unique advantages of high efficiency and universal applicability. Efficient storage and transmission of information provides convenience.
MPEG4 defines different frameworks and levels of encoders and codestreams for different application-dependent code rates, resolutions, qualities, and services. A simple framework provides encoding capabilities for rectangular video objects. What this paper implements is a simple framework for MPEG4 video coding algorithms.
2 MPEG4 encoder hardware platform
The hardware platform for implementing the MPEG4 encoder is based on the TMS320DM642DSP, and is equipped with appropriate external memory SDRAM, FLASH and other peripheral devices.
2.1 TMS320DM642 Features
The TMS320DM642 is a high-performance fixed-point digital signal processor based on the C64x core developed by TI for multimedia applications. It has a clock frequency of 600 MHz and a maximum processing capacity of 4 800 MIPS. The DM642 has a common fixed-point instruction set of the C6000 series DSP, which adds multimedia extension instructions, which makes it easier and faster to execute algorithms in image processing. These features of the DM642 make it ideal for video image processing and are an ideal hardware platform for implementing MPEG4 video encoders.
2.2 Hardware System Structure
The hardware platform of the encoder is shown in Figure 1. In the figure, DM642 is the core of the whole system. The video data is processed at high speed to complete the MPEG4 encoding algorithm. The programmable video format conversion circuit preprocesses the input raw video data and converts it into The encoder can accept digital signals in the video format; E2PROM and FLASH are used to cure the application and initialization parameters, SDRAM is used as off-chip memory, and the video data to be processed is stored in the encoding process, and the above three are connected to the DM642 through the EMIF bus; Through the JTAG interface, CCS can be used to easily implement system hardware and software simulation and debugging; the real-time clock provides real-time time reference information for digital video.

3 MPEG4 encoder software implementation and optimization
3.1 MPEG4 software implementation
MPEG4 is an open framework standard, and does not specify specific algorithms and procedures. Users can develop their own code according to their needs. We use XVID 1.1.0 open source to implement MPEG4 encoder. The XVID code implements the simple framework algorithm of MPEG4, does not require shape coding, and only encodes I-VOP and P-VOP. However, XVID is designed and developed for PC applications. To port it to DSP, the code must be analyzed and modified in accordance with the instruction structure and features of the DSP.
The MPEG4 encoder implemented by the XVID code uses each frame in the original video data as a video object to first determine whether it is an I frame or a P frame. The I frame needs to encode and store the entire frame of image data, and the P frame performs motion estimation and Compensation, encoding only the image residuals and motion vectors between the current frame and the reference frame. Each frame of data is divided into 16Ã—16 macroblocks, and each macroblock is further divided into 8Ã—8 subblocks, and DCT, quantization, and VLC encoding are performed on the basis of macroblocks and subblocks. Based on the low image quality requirements, we have reduced some of the XVID functions, such as GMC (Global Motion Compensation), RVLC, etc., which reduces the amount of code calculation and reduces the complexity.
3.2 Code Optimization
In order to improve the efficiency of code execution, the code must be optimized in combination with the characteristics of the DSP. The optimization is mainly divided into three levels:
3.2.1 Project level optimization
TI provides a powerful integrated development environment CCS, including a variety of efficient compilation tools, the compiler can automatically improve the code during the code compilation process by using compiler options provided by the compiler (such as -o3 and -pm) Structure, reduce the correlation of instructions in the code, improve instruction parallelism through software pipeline, improve loop performance, and optimize code size.

3.2.2 C language program level optimization
By using the profile tool in CCS, the C code is evaluated to find the block with the largest amount of computation, such as DCT, quantization, motion estimation, etc. The optimization of this part of the code has a significant impact on improving the performance of the encoder. We have adopted the following C program level optimization method:
(1) Use C6000 DSP-specific keywords and inline functions to rewrite C code. If you use the keyword restrict to eliminate the correlation between data to improve the code parallel execution ability, and use inline functions (such as _add2(), Nassert()) quickly optimizes C code as a special function that maps directly to inline C6000 instructions, improving code execution efficiency in the DSP.
(2) Use integer access to short data, use 32-bit integer to access two 16-bit short data at a time, and store them in the high and low 16-bit fields of the 32-bit register, which can reduce the number of accesses to the memory. The efficiency of reading data is doubled, and the inline function that can operate on the upper and lower 16 bits of two registers at the same time, such as add2(); mpy2(), can greatly improve the code execution efficiency.
(3) Using the loop unrolling method, the multi-loop is changed to a small loop or even a single loop, which reduces loop nesting and eliminates redundant loops, which can improve the degree of parallel execution of instructions.
(4) DSP does not have a special hardware division unit. The division method is implemented by continuous subtraction, and the calculation amount is relatively large. Therefore, the division operation should be minimized, and the division can not be reduced by the shift operation, which can reduce the computational time.
(5) Use the TI image library function. TI provides powerful IM-AGE library support, including many image processing common functions, such as 8Ã—8 sub-block DCT transform (IMG_fdct_8Ã—8) and SAD calculation (IMG_sad_8Ã—8). These functions are optimized. The code is very efficient and can be directly applied to the program.
3.2.3 assembler level optimization
Linear assembly language is a programming language unique to the C6000 series of DSPs, similar to assembly, but does not need to give detailed information such as functional units, registers, parallelism, etc., the assembly optimizer can be automatically determined according to the code. We rewrite the key parts of the code with large computational complexity and high calling frequency with linear assembly, such as quantization, DCT, SAD and other modules, which further optimize the loop iteration and improve the parallelism of the instructions. Table 2 shows the comparison of the number of clock cycles consumed by several function module programs before and after rewriting the 3-frame foreman.qcif test sequence.

3.3 Storage space configuration
DSP's on-chip memory space is limited, and a large amount of video data (including images such as current frames and reference frames) to be processed by the encoder must be placed off-chip, and the CPU accesses off-chip faster than accessing the chip. Using the EDMA function of the DM642, the CPU encodes the data of the previous frame and moves the off-chip data to the on-chip memory in advance through the ED-MA channel. The two work in parallel, which improves the efficiency of data transmission from off-chip to on-chip. Can reduce CPU wait time.
3.4 Experimental results
Encoder performance is tested by encoding the standard qcif format (176Ã—144) test sequence, including news frame 300 frames, suzie sequence 150 frames, foreman sequence 400 frames, hardware simulation experiments through TI's integrated development environment CCS 2.0 Under the condition that the set code rate is 100 b/s, the results are shown in Table 3.

By analyzing the test sequence coding results, the encoder's coding rate is above 25 fps, which can meet the requirements of real-time coding. In the case where the transmission code rate is lowered, the coding rate can be further improved. From the coding results, it can be found that the compression ratios of different test sequences before and after coding are different, which is caused by the motion of the test sequence image and the background transformation. For example, the suzie sequence has a single background, the motion is moderated, the compression is relatively high, and the news sequence is constantly due to the background. Transform, the compression ratio is relatively low. By comparing the images obtained before and after encoding, the picture has no distortion, and the image quality does not decrease significantly.
4 Conclusion
This paper discusses the implementation scheme and optimization method of MPEG4 encoder on DM642, and implements the simple framework algorithm of MPEG4 encoding. The experimental results show that the proposed scheme has high ease of implementation and practicability. The improved and improved code optimization method is effective, and the performance test has achieved satisfactory results. On this basis, we can further improve the implementation of MPEG4 advanced framework and code optimization methods, and conduct more in-depth research to meet higher application requirements.

Switch&Socket
Rocker Switch,Micro Switch,Pedal Switch
Jingkesai Electric Co., Ltd. , http://www.hobaoelec.com