Method: A floating point operation is changed to a fixed point operation.
Because the C6x DSP board does not support floating-point operations, our original program code is in floating-point format, so it must be changed to fixed-point arithmetic, and its modified execution speed will be much faster. We use the Q-format specification to represent floating point operations. The related principles are described below.
Fixed-point DSPs use a fixed decimal point to represent the fractional part of the number, which also imposes limitations on use. To classify different ranges of decimal points, we must use the Q-format format. Different Q-formats represent different decimal point positions, that is, ranges of integers. For the format of the Q15 number, note that each bit after the decimal point indicates that the next bit is one-half of the previous bit, and the MSB (most-significant-bit) is specified as the sign bit. When the number is set to 0 and the remaining bits are set to 1, the largest positive number (7FFFH) is obtained; and when the number is set to 1 and the remaining bits are set to 0, the largest negative number is obtained (8000H). ). So the Q15 format ranges from -1 to 0.9999694 (@1), so we can increase the range of the integer part by shifting the decimal point to the right. The range of the Q14 format is increased from -2.0 to 1.9999694 (@2). However, the increase in range has sacrificed accuracy.
Method 2 to create a table (table)
The original program design is to read the AAC file, in the decoding, but also to read some C language program code and then do the calculation, such as reading some values ​​for sin, cos, exp operations, but In order to speed up the execution of the program, the results of these operations are built into a table, built into the program, and no additional calculations are required to speed up the program.
Method 3 reduces the length of the program
1. Remove the Debug function
When the original program was in the Debug phase, it added a lot of parts for detecting errors. After the program is Debugged, no errors have occurred, so you can remove these parts to reduce the length of the program and reduce it. The number of clocks when the program is executed, speeding up the program.
2. Remove the calculation clock function
The original program can calculate the number of clocks required to execute the program. We can also remove these parts. If there is a need to calculate the clock, we can use the C6x tool software to make it more powerful.
Method 4 reduces the I/O process
When doing the decoding operation, it is first to read a part of the AAC file for decoding. After the decoding is completed, the next part is read and then decoded. However, since the C6x board is quite slow to read files from the PC, the read operation takes up most of the time, so the program is changed to first read the AAC file into the memory of the C6x, and then decode it. . Or put AAC into a form (about 1 MB) to avoid running out of memory on the DSP board.
Method 5 reduces the call of the subroutine
When calling a subroutine, the contents of the buffer must first be placed on the stack, and when returning from the subroutine, the original contents of these buffers are also taken out of the stack. However, some subroutines are very short in length and are called many times. It is often done in a few clocks but wasted time accessing the contents of the stack, so simply write these short subroutines directly into the main program. In order to reduce the number of clocks.
Method six write assembly language
Although the assembly language compiled by C language can be executed correctly, this assembly language is not the most efficient way to write, so in order to increase the efficiency of the program, in some places, for example, some are called many times and the program code A function that is not long must be replaced by a hand-written assembly language.
Method 7 uses the concept of parallel processing
The C6x is a powerful processor that provides eight internal units that can execute different instructions, which means that up to eight instructions can be processed simultaneously. So if we can use it for parallel processing, we can greatly shorten the execution time of the program, and use it most efficiently for decoding.
Finally, you need to know:
The third level of optimization (-O3), low efficiency (experience), and some such as reading two adjacent 16-bit data with a 32-bit read instruction, etc., you can look at the C optimization manual. But these efficiencies are not high (although ti's propaganda says it can reach 80%, when I did it myself, I found that there is absolutely no such efficiency! 65% is still similar), if you want to improve efficiency, you can only do it with assembly. Also have a look at how your c program is compiled. If there are a lot of interruptions in it, 6000 can be said to have no advantage. Also, the data of the profiler is not accurate, it is bigger than the actual one, and it is hard to say how big. There is also dsp in the initialization is particularly slow, these time is not compared with the PC, if it is better than the core part.
About profile:
The C6x Debug tool provides a profile interface. In Figure 9, there are several important windows included. The upper left window shows the C language we wrote, which lets us know which step we have now. The window in the upper right corner shows the assembly language compiled by C6x. We can also know which step is done now. The window in the lower left corner is the command line, which is the window that lets us down and display the message. The middle profile window is the most important window in profile mode. The items displayed are as follows:
Count the number of times called
Inclusive contains the total number of execution clocks of the subroutine
Incl-Max contains the maximum number of clocks executed by a subroutine
Exclusive does not contain the total number of execution clocks of the subroutine
Excl-Max does not include the maximum number of clocks executed by a subroutine
Using this profile mode, we can analyze the number of times each function in the program is called, the number of clocks executed, and so on. Using the results of this analysis, we can know which function spends the most time, can be improved, and optimize it for it.
Assembly code level optimization
After the optimization of the C code, the performance requirements are not met, you can pass the profile.
The clock tool finds inefficient parts and rewrites them using linear assembly. Compiled by the assembly optimizer, the assembly optimizer does the following from the input linear assembly code:
â— Look for CPU instructions that can be executed in parallel.
â— Process pipeline labels during the software pipeline.
â— Use of the allocation register.
â— Assign functional units.
The assembly optimizer provided by TI can achieve high efficiency and generally meet the performance requirements.
Problem in optimization
In the optimization process, there are always certain changes to the program, so there are often some problems.
1) Verification of optimization results
Optimized programs often don't know if they are running correctly, which needs to be verified. The general approach is to verify by testing the sequence. A test sequence refers to a set of special data taken by different algorithms that accurately reflect the characteristics of the algorithm. Each set of data in the test sequence includes input data and output data. By calculating the input data, the result is compared with the output data to determine the correctness of the program. Some common algorithms generally provide test sequences. Still others, there are no test sequences. At this time, it is necessary to construct a test sequence and verify it according to the characteristics of the algorithm. When constructing, it is best to note that there are several groups of sequences, and the data preferably has a certain length, so that the verification is more accurate.
2) Memory leak problem
The internal storage space of the C64X series DSP is 1MB. The program and data and the secondary cache of the CPU will share this space. Therefore, when the program runs abnormally, it is likely to be caused by a memory leak. Therefore, in the program design, you should try not to use the pointer, and pay attention to the boundary detection.
Some methods of programming
At the time of programming, everything is aimed at meeting actual requirements. In the actual design, in addition to optimization can improve performance, other methods can be adopted to utilize the characteristics of the DSP to improve the running performance of the program and meet the actual design requirements.
1) Put the program and the data that is often used into the on-chip RAM
On-chip RAM and CPU
Working at the same clock frequency, much higher performance than off-chip RAM. Therefore, putting the program on the chip can greatly improve the speed of operation. At the same time, for some of the data that is often used, putting it into the chip will also save processing time.
2) Moving data through DMA technology
For the C64X chip, its on-chip RAM has 1MB, but for some large-scale image processing algorithms, it may still be insufficient. Therefore, the data that needs to be used is often moved into the chip by DMA technology, and the unnecessary data is moved. Off-chip, can greatly improve the speed of the program.
3) Use of CACHE
Increasing CACHE can significantly improve performance. However, the program and data in the C64X series DSP and CACHE share the on-chip RAM. Therefore, increasing the CACHE reduces the actual on-chip free space, and needs attention in the design.
The through-wall terminals can be installed side by side on panels with thicknesses ranging from 1mm to 10mm, and can automatically compensate and adjust the thickness of the panel to form a terminal block with any number of poles. In addition, isolation plates can be used to increase air gaps and creepage distances.
Through-Wall Terminal,Through Wall Terminal Block,Through-Wall Terminal Extender,Through-The-Wall Terminal Block
Sichuan Xinlian electronic science and technology Company , https://www.sztmlch.com