## **UC Irvine** ## **ICS Technical Reports** #### **Title** Design exploration for pipelined IDCT #### **Permalink** https://escholarship.org/uc/item/22c156f2 #### **Authors** Gajski, Daniel D. Grun, Peter Pan, Wenwei <u>et al.</u> #### **Publication Date** 1996-09-12 Peer reviewed SL BAR Z 699 c3 no.9641 Notice: This Material may be protected by Copyright Law (Title 17 U.S.C.) ## **Design Exploration for Pipelined IDCT** Daniel D. Gajski Peter Grun Wenwei Pan Smita Bakshi Technical Report #96-41 September 12, 1996 Department of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425 (714) 824-7063 > gajski@ics.uci.edu pgrun@ics.uci.edu wpan@ics.uci.edu sbakshi@ics.uci.edu #### Abstract ASICs for video compression systems have stringent timing requirements. For example, according to the MPEG standard, the throughput of the MPEG decoder is 30 frames per second. This performance cannot be achieved without efficient pipelining. In this report, we explore the pipelined designs for the Inverse Discrete Cosine Transform (IDCT) which is a critical part of the MPEG decoder. We also transform the algorithm to minimize the memory requirement. We have implemented both the original and memory-optimized algorithms at the RT level, using our realistic library. Notice: This Material may be protected by Copyright Law (Title 17 U.S.C.) # Contents | 1 Introduction | 3 | |-------------------------------------|---| | 2 Example Description | 3 | | 3 Library Components | 4 | | 4 Memory Optimization | 5 | | 5 Pipelining Exploration | 6 | | 5.1 Pipelining Levels. | | | 5.1.1 Original Algorithm | 7 | | 5.1.1.1 Process Pipelining | 7 | | 5.1.1.2 Loop Body Pipelining | 7 | | 5.1.1.3 Functional Unit Pipelining | 8 | | 5.1.2 Memory-Optimized Algorithm | 8 | | 5.1.2.1 First Loop Body Pipelining | 9 | | 5.1.2.2 Second Loop Body Pipelining | 9 | | 5.1.2.3 Functional Unit Pipelining | 9 | | 5.2. Performance Comparison | 9 | | 6 VHDL Model Hierarchy | 0 | | 7 Design Example | | | 7.1 Loop Body Datapath | 2 | | 7.2 IDCT Block Diagram | | | 8 Conclusions | | | 9 References | | | 10 Appendix | | | 10.1 IDCT System | | | 10.2 IDCT with RAM. | | | 10.3 IDCT Computation | 5 | | | | | List of Figures | | | 1 IDCT algorithm | 4 | | 2 Memory optimization technique | | | 3 Process pipelining | | | 4 Loop pipelining | | | 5 Functional unit pipelining | | | 6 First loop pipelining | | | 7 Second loop pipelining | | | 8 Functional unit pipelining | | | 9 Pipelining exploration | | | 10 VHDL description hierarchy | | | 11 Loop body datapath | | | 12 Block diagram of IDCT chip | | | | | | List of Tables | | | 1 Components library | 4 | | 2 Original algorithm pipelining | | | 3 Optimized algorithm pipelining | | ### **Design Exploration for Pipelined IDCT** Daniel D. Gajski, Peter Grun, Wenwei Pan, Smita Bakshi Department of Information and Computer Science University of California, Irvine, CA 92697-3425 #### **Abstract** ASICs for video compression systems have stringent timing requirements. For example, according to the MPEG standard, the throughput of the MPEG decoder is 30 frames per second. This performance cannot be achieved without efficient pipelining. In this report, we explore the pipelined designs for the Inverse Discrete Cosine Transform (IDCT) which is a critical part of the MPEG decoder. We also transform the algorithm to minimize the memory requirement. We have implemented both the original and memory-optimized algorithms at the RT level, using our realistic library. #### 1 Introduction In this report, we give a detailed example showing how to design pipelined digital systems from behavioral description. The example is a custom ASIC for Inverse Discrete Cosine Transform (IDCT), which is used in the MPEG decoder. Details of the design as well as the source listing of VHDL code are given in the following sections. According to [5], the MPEG chip has to decode 30 frames/second, where one frame consists of 720 x 480 pixels. A macroblock is said to cover 16 x 16 pixels, hence there are 1350 macroblocks/frame. Since one macroblock contains 6 blocks, we can derive the throughput of 4115.2 ns/block, where a block is the basic data unit computed in IDCT. Keeping this in mind, we give designs for two different algorithms: the first one is an original algorithm consisting of two loops, the second one is a transformed algorithm in which the two loops in the first algorithm are merged aiming at memory size optimization. By pipelining the designs for these two different algorithms in different ways, we obtain a variety of cost/performance trade-offs. Section 2 presents the specification of the IDCT example, section 3 describes the library components needed. In section 4 we show the memory optimization, and in section 5 we present the pipelining exploration, by comparing several design alternatives. After the conclusions in section 6, in the appendix we give the VHDL code for our fastest pipelined designs. ### 2 Example Description IDCT is often a critical part in both still and motion picture compression. It takes as input an N x N image block, and creates an output matrix which is subsequently used in the next stages of the computation. The definition of IDCT for a block of N x N image is: $$O[u, v] = \frac{1}{N^2} \sum_{m=0}^{N-1} \sum_{n=0}^{N-1} I[m, n] \cos \frac{(2m+1) u\pi}{2N}$$ $$\cos \frac{(2n+1) v\pi}{2N}$$ where: u, v = discrete frequency variables $0 \le u, v \le N - 1$ . I[m,n] = input, representing the gray level of a pixel at position (m,n) of the N $\times$ N image $0 \le m, n \le N-1$ . O[u,v] = output, representing the coefficient of point(u,v) in spatial frequency domain. The mathematical definition for IDCT can be thought of as a pair of matrix multiplications. I is the input matrix, COS is a coefficient matrix, and O is the output matrix. In the floating point domain, we define a coefficients matrix C as: $$C[u,v] = \frac{1}{N}\cos\frac{(2n+1)u\pi}{2N}$$ Based on C, we define the integer coefficients matrix COS as: $$COS = round(factor * C)$$ and COST as its transposed matrix. Let I be the N x N input block of image, and O be the output N x N matrix. Then, $$O = COS \times I \times COS^{T}$$ $$Temp = I \times COS^{T}$$ $$O = COS \times Temp$$ In the MPEG standard, N is 8. Or. Figure 1 Shows the flow graph of the IDCT algorithm. ## **3 Library Components** During design space exploration, we use different components to implement the operations and storage elements from the IDCT algorithm, to obtain a variety of cost/performance trade-offs. In Table 1 we show the component library used. | No | Component | Delay (ns) | Cost (trans.) | |----|---------------------|------------|---------------| | 1 | 16 bit selector | 0.4 | 224 | | 2 | 32 bit selector | 0.4 | 448 | | 3 | 16 bit CLA<br>adder | 2.1 | 1074 | | 4 | 32 bit CLA<br>adder | 2.9 | 2148 | Table 1: Components library. | No | Component | Delay<br>(ns) | Cost (trans.) | |----|----------------------------------------------|---------------|---------------| | 5 | 8 bit booth<br>multiplier | 5.6 | 3562 | | 6 | 16 bit booth<br>multiplier | 8.8 | 11220 | | 7 | 8 bit booth<br>multiplier<br>(2 stage pipe) | 3.5 | 4210 | | 8 | 16 bit booth<br>multiplier<br>(2 stage pipe) | 5.4 | 12624 | | 9 | 8 bit booth<br>multiplier<br>(4 stage pipe) | 2.7 | 5218 | | 10 | 16 bit booth<br>multiplier<br>(4 stage pipe) | 3.5 | 15036 | | 11 | 8 bit register | 0.4 | 256 | | 12 | 9 bit register | 0.4 | 272 | | 13 | 16 bit register | 0.4 | 512 | | 14 | 32 bit register | 0.4 | 1024 | | 15 | 9 bit counter | 2.5 | 414 | | 16 | 64x16 RAM | 3.5 | 6144 | | 17 | 64x8 ROM | 3.5 | 2048 | Table 1: Components library. The components used in our design are from [3]. Due to technology improvement [6], the delay of an inverter is 0.1 ns which is 10% of the nominal delay of 1 ns of the inverter from [3]. Therefore, we scaled down the delays for all the components by a factor of 10. In column three we show the delays of the components after this consideration. We use the worse case delays for single signal change from any input to any output. For the pipelined components, the delay represents the delay of the longest stage. For the storage components it represents the average between the reading and writing time. In column four we show the cost of each component. We use the same estimation with [3] for the basic gates (nand, nor, inverter), which we use subsequently in the computation of the cost of the rest of components, in a bottom-up fashion. ## **4 Memory Optimization** As previously mentioned, the IDCT algorithm consists of two consecutive matrix multiplications. A new input sample of 64 bytes arrives every 4115 ns. The input is stored in a 64x8 input RAM. After computation, the output is stored in a 64x8 output RAM. Our goal is: given a component library, to design an IDCT chip which has the lowest area, while still satisfying the time constraint. Experiments have shown that in most video and speech processing applications, more than 50% of the chip area is occupied by memory units [4]. Therefore, by minimizing the storage requirements, we can obtain more area improvement than by other datapath components optimization. Hence, to drastically decrease the cost of the chip, we need to start by optimizing the memory usage. Starting from the initial behavioral description of the IDCT (see appendix), we show how to transform the algorithm in order to lower the memory requirements. As shown in section 2, the algorithm first multiplies the input matrix with the transposed cosine matrix generating a temporary matrix, and then multiplies the cosine matrix with the temporary to obtain the final result. Therefore, for this behavioral description, besides the input and output memories, we need another storage for the temporary result: Temp = $$I \times COS^T$$ O = $COS \times Temp$ One way to reduce the memory requirements is to eliminate the temporary memory from the computation. In order to do this, the order in which the operations in the algorithm are done has to be changed. In the initial description, the first matrix multiplication computes all the elements of the temporary matrix, and the second matrix multiplication uses them to create the output matrix. If we could schedule all the uses of each element from the temporary matrix immediately after they are created, we would need only one word as temporary storage instead of the whole matrix. In the initial algorithm, the creations and uses of the elements of Temp are interleaved, which implies that their lifetimes overlap. Thus, they cannot use the same memory location. On the other hand, if we manage to transform the algorithm so that the creations and uses of these elements are not interleaved, their lifetimes will not overlap, and only one memory location will be needed. Instead of creating the whole Temp matrix, and then using it, we create one element at a time, and use it wherever it is needed before creating a new element. Therefore the new element can share the same memory location with the previous one, and instead of the whole Temp matrix we only need one storage location. In Figure 2 a) we show the matrix multiplications for the original algorithm. Here we first compute all the elements from the Temp matrix, and then use them column by column in the second matrix multiplication. In Figure 2 b) we show the optimized version of the algorithm. Here we first compute one element of the Temp matrix, and then use it immediately in the second matrix multiplication to accumulate the partial sums for the corresponding column from the O matrix. In the appendix we show the behavioral VHDL description of both the original and the optimized algorithms. These two algorithms represent the starting point in our design process. In the next sections we show several pipelined implementations of these algorithms, generating different cost/performance trade-offs. ## 5 Pipelining exploration Most digital circuits today operate under certain timing constraints. For example, according to the MPEG standard, the IDCT stage should be done within 4115 ns. Pipelining is an efficient way to greatly improve the performance of a digital system, without significantly increasing the cost. To explore a large variety of cost/performance trade-offs, different levels of pipelining can be used. Intuitively, if the design is more pipelined, a performance gain is obtained, against an increase in complexity and a small degradation of the cost of the design. Any computation composed of a set of tasks which operate in a sequential order on data arriving repetitively, can be pipelined by considering each task a different stage. Depending on what we consider as the computation and it's constituent tasks, we obtain pipelining at different levels. The original IDCT algorithm is composed of two loops representing the two matrix multiplications. The memory-optimized algorithm is composed of a single loop consisting of two inner loops. In the following we present the different levels of pipelining for each of these algorithms. ``` for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := I(i,k); B := COS(j,k); P := A * B: SUM := SUM + P; Temp(i,j) := SUM; end loop; end loop; end loop; stage 1 stage 2 for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := COS(i,k); B := Temp(k,j); P := A * B; SUM := SUM + P; O(i,j) := SUM; end loop; end loop; end loop; Figure 3: Process pipelining ``` ### 5.1 Pipelining Levels. #### 5.1.1 Original algorithm. #### 5.1.1.1 Process pipelining If we consider the whole IDCT algorithm as a computation, and the two matrix multiplications as the constituent tasks, we obtain the first level of pipelining. We call this process pipelining. The two matrix multiplications represent the two stages of the pipeline, as shown in Figure 3. #### 5.1.1.2 Loop body pipelining Each matrix multiplication consists of a loop. If we consider the body of this loop as a computation, and the different operations (such as ``` for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := I(i,k); B := COS(j,k); stage 1 ...... P := A * B; stage 2 ...... SUM := SUM + P; stage 3 ...... Temp(i,j) := SUM; stage 4 end loop; end loop; end loop; (a) for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := COS(i,k); B := Temp(k,j); stage 1 P := A * B: ...... SUM := SUM + P; stage 3 ..... O(i,j) := SUM; stage 4 end loop; end loop; end loop; (b) ``` Figure 4: Loop body pipelining for a) the first matrix multiplication and b) the second matrix multiplication the memory read, memory write, multiplication and addition) as it's constituent tasks, we obtain the second level pipelining. We call this loop body pipelining. Figure 4 shows the pipe stages in the loop body pipelining. #### 5.1.1.3 Functional unit pipelining If we consider only one operation, such as the multiplication, as a computation, we can divide it into different tasks, and consider each task as a different pipe stage. This results in the third level pipelining, which we call functional unit pipelining. ``` for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := I(i,k); B := COS(j,k); stage 2 SUM = SUM + P: Temp(i,j) := SUM; end loop; end loop; end loop; (a) for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := COS(i,k); B := Temp(k,j); stage 1 SUM := SUM + P; O(i,j) := SUM; end loop; end loop; end loop; (b) ``` Figure 5: Functional unit pipelining for a) first matrix multiplication and b) second matrix multiplication Figure 5 shows the functional unit pipelining. In this case the multiplier is divided into 4 stages, while the other operations are non-pipelined. #### 5.1.2 Memory-optimized algorithm. The memory-optimized algorithm is comprised of an outer loop, which in turn consists of two inner loops. If we consider the whole algorithm as a computation, we should divide it into smaller tasks. Since there is only one outer loop in the algorithm, we cannot split it into different parts without changing the specification. Therefore, we cannot pipeline it at the process level. On the other hand, if we consider the body of this loop as a computation and the two inner loops as the constituent tasks, we obtain a first loop body pipelining. By considering the body of each of the innermost loops as a computation comprised of the tasks of memory read, multiplication, addition and memory write, a second loop body pipelining is obtained. As in the case of the original algorithm, each operation, such as multiplication, can be con- ``` for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop end loop; stage 1 for k:=0 to 7 loop end loop; end loop; end loop; end loop; end loop; end loop; ``` sidered as a computation consisting of several parts. Each part is a different task, representing the stages of the pipeline. #### 5.1.2.1 First loop body pipelining. By pipelining the body of the j loop we obtain a two-stage pipeline, as shown in Figure 6. In this case, each stage consists of an inner loop executing 8 times. #### 5.1.2.2 Second loop body pipelining. ``` for i:=0 to 7 loop for j:=0 to 7 loop for k:=0 to 7 loop A := I(i,k); B := COS(j,k); stage 1 P := A * B: stage 2 ..... SUM := SUM + P; stage 3 end loop; Temp := Sum; for k:=0 to 7 loop C := O(k,j); D := COS(k,i); stage 1 stage 2 P := D * Temp; stage 3 SUM := C + P; O(k,j) := SUM; stage 4 end loop; end loop; end loop; Figure 7: Second loop body pipelining for the two innermost loops. ``` Figure 7 shows the innermost loop pipelining. Here we consider the body of the k loops as the computation to be pipelined, and the memory read, multiplication, addition and memory write tasks represent the pipe stages. ### 5.1.2.3 Functional unit pipelining. The functional units implementing operations from the algorithm can be also pipelined. The computation can be divided in several tasks, each task to represent a pipe stage, as shown in Figure 8. ## 5.2 Performance Comparison. In a time constrained design flow, the goal is to obtain the lowest cost implementation while still satisfying the time constraint. In the IDCT example the computation has to be executed in 4115 ns. By pipelining the design, we can increase the performance, without a significant impact on the cost. In the previous section we presented the different pipelining options we have for the IDCT algorithm. In the final design, these options can be combined in any way, to obtain different cost/performance trade-offs. For example, we show a comparison between several alternatives. | No | Pipelining | Throughput (ns) | Cost (trans.) | |-----|-------------------------------|-----------------|---------------| | 1.1 | non-pipelined | 36044.8 | 22198 | | 1.2 | functional unit<br>pipelining | 25088 | 26014 | Table 2: Original algorithm pipelining | No | Pipelining | Throughput (ns) | Cost (trans.) | |-----|---------------------------------------------------|-----------------|---------------| | 1.3 | process<br>pipelining | 18022.4 | 41486 | | 1.4 | loop body pipe-<br>lining | 9046 | 24550 | | 1.5 | process + loop +<br>functional unit<br>pipelining | 1813 | 53822 | Table 2: Original algorithm pipelining | No | Pipelining | Throughput (ns) | Cost (trans.) | |-----|----------------------------------------------------------------|-----------------|---------------| | 2.1 | non-pipelined | 40550 | 16054 | | 2.2 | functional unit<br>pipelining | 23296 | 19870 | | 2.3 | first loop body<br>pipelining | 18022 | 29646 | | 2.4 | second loop<br>body pipelining | 11827 | 17590 | | 2.5 | first loop +<br>second loop +<br>functional unit<br>pipelining | 1844 | 42350 | Table 3: Optimized algorithm pipelining Tables 2 and 3 show the performance and cost of different implementations using the two algorithms for different combinations of pipelining. We present the non-pipelined (1.1, 2.1), partially pipelined (1.2, 1.3, 1.4, 2.2, 2.3, 2.4) and completely pipelined (1.5, 2.5) designs. In Figure 9 we present the design alternatives in the cost/throughput space. The points denoted 1.1-1.5 represent different implementations of the original algorithm, whereas the points 2.1-2.5 represent implementations of the memory-optimized algorithm. By comparing the points on each curve, we see that the more pipelined the design, the higher the performance. The points 1.5 and 2.5 represent the most highly pipelined versions, combining all the three pipelining levels. As expected, they have the highest performance and highest cost. The position of the points 1.2, 1.3, 1.4 shows the relationship between the different pipelining alternatives. The most performance gain is obtained by the loop body pipelining, whereas the lowest gain is generated by the functional unit pipelining. The reason that the point 1.4 has lower throughput but lower cost than 1.3 is because in design 1.4 there is no processing pipelining which means sharing the same datapath for the inner loop body is obvious and considered in the estimation. A similar relationship is obtained for the memory-optimized algorithm. By comparing the two curves representing the original and the memory-optimized algorithm, we see that the overall cost of the optimized algorithm is generally lower than the cost of the original one. ## 6 VHDL Models Hierarchy. All the VHDL models are developed hierarchically in a bottom up fashion, as shown in Figure 10. (1). The 1st level of hierarchy consists of the basic gates, *muxes* and *flip flops*. All the VHDL models in this level have only behavioral description. All the higher level components are composed of these basic entities. The delay information for these gates and flip flops are obtained from [3] and scaled down by 10. This is because in [3] we use a calibrated method by which we assume the delay of inverter is 1 ns while all other gates and flip flops are correlated to this delay. For here, we have assumed the delay of inverter is 0.1 ns due to current technology in order to satisfy the time constraint of the MPEG decoder. Therefore, all the delays in this report are scaled down by 10. - (2). The 2nd level of hierarchy consists of the 16-bit and 32-bit adders, selectors, multipliers, and registers. They appear as RT level components in the datapath. All the VHDL models in this level have both the behavioral and structural description. - (3). Another level of hierarchy is the memory. The 64x8 ROM is used for storing the COS matrix. The 64x8 RAM and 64x32 RAM are used as input and output buffers for the IDCT as interface with other stages in the MPEG decoder. In the implementation of the original algorithm, 64x16 RAM is also used as a buffer between the two matrix multiplications. - (4). The 4th level of hierarchy consists of the datapath and controller. The datapath model is an RT level structural model. The controller is behavioral model consisting of next-state logic, output logic and state registers. - (5). The 5th level of hierarchy comprises the IDCT computation along with the input and output memories (the IDCT with RAM entity). It also includes the Test Bench entity, - (6). The 6th level of hierarchy simply incorporates the IDCT with RAM model and the testbench to be simulated. ### 7 Design Example. In this section, we present a design example for original IDCT algorithm using process pipelining, loop body pipelining and functional unit pipelining together. Figure 3, 4, 5 shows the VHDL code for each of these pipelining techniques. #### 7.1 Loop body datapath Figure 11 shows the datapath design for the loop body of matrix multiplication. It is obvious from Figure 4 that the two matrix multiplications in IDCT have the same datapath structure. #### 7.2 IDCT Block Diagram Figure 12 shows the block diagram of the IDCT chip working in a pipelined MPEG decoder. The In\_RAM and the Out\_RAM are used as interface with other stages of pipelin- ing in MPEG decoder. Both the loop1 datapath and loop2 datapath can use the loop body datapath in Figure 11. The controllers for the two matrix multiplications are designed to be the same for simplicity. The ROM stores the COS matrix. The RAM1 and RAM2 are used to store the Temp matrix. In the operation mode of process pipelining, they alternate the read/write functionality. The Mem-swap logic is used to achieve this. The counter1 and counter2 and their associated logic are used to control the precise memory addressing. #### 8 Conclusions. This report presents techniques for pipelining exploration and memory optimization. We first lower the total cost of the design by transforming the algorithm. Given a timing constraint, we traverse the design space by using different pipelining alternatives. Finally we compare the different designs created, and show what are the implications of the early design decisions on the final implementation. We obtain a large spectrum of cost/performance trade-offs, providing a good starting point for the next levels of synthesis. #### 9 References. - [1] D. D. Gajski, "Principles of Digital Design", Prentice Hall 1996. - [2] D. D. Gajski, N. Dutt, A. Wu, and S. Lin, "High Level Synthesis: Introduction to Chip and System Design", Kluwer Academic Publishers, 1992. - [3] W. Pan, P. Grun, D.D. Gajski, "Behavioral Exploration with RTL Library", Technical Report, UCI ICS #96-34, July 29, 1996. - [4] F. Catthoor, W. Geurts, H. De Man, "Loop Transformation Methodology for Fixed - Rate Video, Image and Telecom Processing Applications", Proc. Int. Conf. on Applic. Spec. Array Processors, San Francisco, CA, Aug. 1994. - [5] A.B. Thordarson, "Comparison of Manual and Antomatic Behavioral Synthesis on MPEG algorithm", M.S. thesis, UCI-ICS, 1995. - [6] LCB 500K, Preliminary Design Manual, LSI Logic, June 1995 ## 10 Appendix. library ieee; end component; ### 10.1 IDCT System. ``` -- represents the whole system which includes ``` - -- the test bench, the input and output memory, - -- and the IDCT computation. ``` use ieee.std_logic_1164.all; use ieee.std_logic_misc.all; use ieee.std_logic_arith.all; entity idct_system is end idct_system; architecture test of idct_system is component idct_with_ram port ( clk:in std_logic; rd_addr:in std_logic_vector(5 downto 0); start:in std_logic_vector(5 downto 0); wr_addr:in std_logic_vector(7 downto 0); done:out std_logic; rd_data:out std_logic_vector(31 downto 0)); ``` ``` signal clk, start, done : std_logic; signal wr_addr,rd_addr : std_logic_vector(5 downto 0); signal wr_data : std_logic_vector(7 downto 0); signal rd_data : std_logic_vector(31 downto 0); begin ``` u1: idct\_with\_ram port map(clk,rd\_addr,start, ``` wr_addr,wr_data,done,rd_data); for i in 40 to 63 loop process wr_addr <=conv_std_logic_vector(i,6); variable clk_value : std_logic := '1'; wr_data <= conv_std_logic_vector(255,8); wait for 0 ns; end loop; clk_value := not clk_value; clk <= clk_value; wait for 2 ns; handshaking end process; wait until start = '1'; start <= '0' after 0 ns, '1' after 5 ns, '0' after 10 ns; handshaking process wait until done = '1'; type rf is array (0 to 7, 0 to 7) of integer; wait until clk = '1'; variable result : rf := ( (88710930, -18305430, 22913790, -1664130, verify 14721150, 3968310, 10368810, 7296570). (-5478165, 1130415, -1414995, 102765, - for i in 0 to 7 loop 909075, -245055, -640305, -450585), for j in 0 to 7 loop (68742135, -14184885, 17755905, -1289535, rd_addr <= conv_std_logic_vector(i,3) & 11407425, 3075045, 8034795, 5654115 ), conv_std_logic_vector(j,3); (-5654880, 1166880, -1460640, 106080, - wait for 4 ns; 938400, -252960, -660960, -465120 ), data := conv_integer(signed(rd_data)); (-7422030, 1531530, -1917090, 139230, - assert (data = result(i, j)) report "error" 1231650, -332010, -867510, -610470 ), severity warning; (-530145, 109395, -136935, 9945, - end loop; 87975, -23715, -61965, -43605), end loop; (25623675, -5287425, 6618525, -480675, 4252125, 1146225, 2994975, 2107575 ), (12723480, -2625480, 3286440, -238680, end process; 2111400, 569160, 1487160, 1046520 ) end test; variable data: integer; begin configuration cfg_idct_system of idct_system is for test feed data -- sandwich case end for; end cfg_idct_system; for i in 0 to 23 loop wr_addr <=conv_std_logic_vector(i,6); wr_data <= conv_std_logic_vector(255,8); 10.2 IDCT with RAM. wait for 0 ns; end loop; -- includes the IDCT computation and the for i in 24 to 39 loop -- input and output memories. wr_addr <=conv_std_logic_vector(i,6); wr_data <= conv_std_logic_vector(0,8); library ieee; wait for 0 ns: use ieee.std_logic_1164.all; end loop; use ieee.std_logic_misc.all; ``` ``` use ieee.std_logic_arith.all; use ieee.std_logic_components.all; m2: ram2 port map ( rd_addr=>rd_addr, rd_addr2=>addr3, entity idct with ram is wr_addr=>addr2,wr_data=>data2, port ( clk: in std logic; rd data=>rd data, rd data2=>data3): rd_addr: in std logic vector(5 downto 0); m1: ram start: in std logic; port map (rd_addr=>addr1, wr_addr=>wr_addr. wr_addr: in std_logic_vector(5 downto 0); wr_data=>wr_data, rd_data=>data1); wr_data: in std_logic_vector(7 downto 0); idct_i : idct done : out std_logic; port map (clk=>clk, rd data=>data1, rd data2=> rd data: out std logic vector(31 downto 0)); data3, start=>start, done=>done. end idct with ram; rd_addr=>addr1, rd_addr2=>addr3, wr_addr=>addr2, wr_data=>data2); architecture schematic of idct_with_ram is end schematic; data3: std_logic_vector(31 downto 0); signal 10.3 IDCT computation. signal addr3: std logic vector(5 downto 0); signal data2 : std_logic_vector(31 downto 0); -- includes the IDCT computation data1 : std_logic_vector(15 downto 0); signal addr2: std_logic_vector(5 downto 0); signal signal addr1: std_logic_vector(5 downto 0); 10.3.1 IDCT original algorithm behavior. component ram2 port (rd_addr: in std_logic_vector(5 downto 0); rd_addr2: in std_logic_vector(5 downto 0); library ieee; wr_addr: in std_logic_vector(5 downto 0); use ieee.std_logic_1164.all; wr_data: in std_logic_vector(31 downto 0); use ieee.std_logic_misc.all; rd_data: out std_logic_vector(31 downto 0); use ieee.std logic arith.all; rd_data2 : out std_logic_vector(31 downto 0) use ieee.std logic components.all; end component; entity idct is component ram clk: in std_logic; port ( port (rd_addr: in std_logic_vector(5 downto 0); rd_data: in std_logic_vector(15 downto 0); wr_addr: in std_logic_vector(5 downto 0); rd_data2: in std_logic_vector(31 downto 0); wr_data: in std logic vector(7 downto 0); start : in std_logic; rd_data: out std_logic_vector(15 downto 0) done : out std_logic; rd_addr: out std_logic_vector(5 downto 0); end component; rd_addr2: out std_logic_vector(5 downto 0); wr_addr: out std_logic_vector(5 downto 0); component idct wr_data : out std_logic_vector(31 downto 0)); clk: in std logic; port ( end idct; rd_data: in std_logic_vector(15 downto 0); rd_data2: in std_logic_vector(31 downto 0); start : in std_logic; architecture behavioral 1 of idct is done : out std_logic; begin rd_addr: out std_logic_vector(5 downto 0); process rd_addr2: out std_logic_vector(5 downto 0); type rf is array (0 to 7, 0 to 7) of integer; wr_addr: out std_logic_vector(5 downto 0); variable cosblock: rf; wr_data: out std_logic_vector(31 downto 0)); variable tempblock: rf; end component; variable a, b, p, sum: integer; for all: idct use entity work.idct(schematic); begin begin ``` ``` b := cosblock(i, k); p := a * b; --intialize parameter matrix cosblock := ( if (k = 0) then (125, 122, 115, 103, 88, 69, 47, 24), sum := p; (125, 103, 47, -24, -88, -122, -115, -69), else (125, 69, -47, -122, -88, 24, 115, 103), sum := sum + p; (125, 24, -115, -69, 88, 103, -47, -122). end if: (125, -24, -115, 69, 88, -103, -47, 122), if (k = 7) then (125, -69, -47, 122, -88, -24, 115, -103), -- m2(i,j) := sum; (125, -103, 47, 24, -88, 122, -115, 69), wr_addr <= conv_std_logic_vector(i,3) & (125, -122, 115, -103, 88, -69, 47, -24) conv_std_logic_vector(j,3); ): wr_data <= conv_std_logic_vector(sum,32); wait for 0 ns: -- handshaking end if: end loop; wait until start = '1'; end loop; done <= '0': end loop; --matrix multiplication 1 --handshaking for i in 0 to 7 loop for j in 0 to 7 loop wait until clk = '1'; for k in 0 to 7 loop done <= '1'; -a := m1(i,k); rd_addr <= conv_std_logic_vector(i,3) & end process; conv_std_logic_vector(k,3); wait for 4 ns; end behavioral1; a := conv_integer(unsigned(rd_data)); b := cosblock(j, k); IDCT Memory-Optimized 10.3.2 p := a * b; Algorithm Behavior. if (k = 0) then sum := p; library ieee; use ieee.std_logic_1164.all; else sum := sum + p; use ieee.std_logic_misc.all; end if; use ieee.std_logic_arith.all; use ieee.std_logic_components.all; if (k = 7) then tempblock( i, j ) := sum; end if; end loop; entity idct is end loop; port ( clk: in std_logic; end loop; rd_data: in std_logic_vector(15 downto 0); rd_data2: in std_logic_vector(31 downto 0); start : in std_logic; -- matrix multiplication 2 done : out std_logic; rd_addr: out std_logic_vector(5 downto 0); for i in 0 to 7 loop rd_addr2: out std_logic_vector(5 downto 0); for j in 0 to 7 loop wr_addr : out std_logic_vector(5 downto 0); for k in 0 to 7 loop wr_data: out std_logic_vector(31 downto 0)); a := tempblock( k, j ); end idct; ``` ``` wait for 4 ns; architecture behavioral2 of idet is c := conv_integer(signed(rd_data2)); begin end if; process d := cos(k,i): type rf is array(0 to 7,0 to 7) of integer; p := d * temp; if (i=0) then variable i,j,k: integer; sum := p; variable a,b,c,d,p,s,sum,temp: integer; else variable cos: rf; sum := c + p; begin end if; cos := ( -- m2(k,j) := sum; (125, 122, 115, 103, 88, 69, 47, 24), wr_addr <= conv_std_logic_vector(k,3) & (125, 103, 47, -24, -88, -122, -115, -69), conv_std_logic_vector(j,3); (125, 69, -47, -122, -88, 24, 115, 103), wr_data <= conv_std_logic_vector(sum, 32); (125, 24, -115, -69, 88, 103, -47, -122), (125, -24, -115, 69, 88, -103, -47, 122), wait for 4 ns; (125, -69, -47, 122, -88, -24, 115, -103), end loop; (125, -103, 47, 24, -88, 122, -115, 69), end loop; (125, -122, 115, -103, 88, -69, 47, -24) end loop; ); --handshaking -- handshaking ----- wait until clk = '1'; wait until start = '1'; done <= '1'; done <= '0'; end process; end behavioral2; -- matrix 1 and matrix 2 for i in 0 to 7 loop 10.3.3 IDCT Original Algorithm for j in 0 to 7 loop for k in 0 to 7 loop Behavior of Pipelined Design. -- a := m1(i,k);-- reading m1 rd_addr <= conv_std_logic_vector(i,3) & conv_std_logic_vector(k,3); library ieee; use ieee.std_logic_1164.all; wait for 4 ns; use ieee.std_logic_misc.all; a := conv_integer(unsigned(rd_data)); use ieee.std_logic_arith.all; b := cos(j,k); use ieee.std_logic_components.all; p := a * b; if (k=0) then -- includes the controller and datapath processes for s := p; matrix multiplication 1 and 2. else s := s + p; entity idct is end if; port ( clk: in std_logic; end loop; rd_data: in std_logic_vector(7 downto 0); temp := s; rd_data2 : in std_logic_vector(31 downto 0); for k in 0 to 7 loop start: in std logic; if (i/=0) then done : out std_logic; -c := m2(k,j); rd_addr : out std_logic_vector(5 downto 0); rd_addr2 <= conv_std_logic_vector(k,3) & rd_addr2: out std_logic_vector(5 downto 0); conv_std_logic_vector(j,3); ``` ``` wr_addr : out std_logic_vector(5 downto 0); vardone1 := '0'; wr_data: out std_logic_vector(31 downto 0)); when 1 => end idct; nextstate1 := 2; when 2 => architecture behavioral1_pipe15 of idct is nextstate1 := 3: type rf is array (0 to 7, 0 to 7) of integer; when 3 => signal cosblock: rf; nextstate1 := 4; signal tempblock1, tempblock2: rf; when 4 \Rightarrow signal state1, state2: integer:=0; nextstate1 := 5; signal a, b, p1,p2,p3,p4, sum: integer; when 5 \Rightarrow signal a 2, b 2, p1 2,p2 2,p3 2,p4 2, sum 2: inte- nextstate1 := 6; when 6 => signal count, count2: unsigned (9 downto 0):= nextstate1 := 7; "000000000"; when 7 => signal done1, done2: std logic; if (varcount < 512) then signal memswap : std_logic := '0'; nextstate1 := 7; nextstate1 := 8; begin end if: when 8 => nextstate1 := 9; --intialize parameter matrix when 9 \Rightarrow nextstate1 := 10; cosblock <= ( (125, 122, 115, 103, 88, 69, 47, 24), when 10 \Rightarrow (125, 103, 47, -24, -88, -122, -115, -69), nextstate1 := 11; (125, 69, -47, -122, -88, 24, 115, 103), when 11 => (125, 24, -115, -69, 88, 103, -47, -122), nextstate1 := 12; (125, -24, -115, 69, 88, -103, -47, 122), when 12 => (125, -69, -47, 122, -88, -24, 115, -103), nextstate1 := 13; (125, -103, 47, 24, -88, 122, -115, 69), when 13 => (125, -122, 115, -103, 88, -69, 47, -24) nextstate1 := 0; vardone1 := '1'; ); when others => end case: state1 <= curstate1: -- matrix multiplication 1 _____ count <= varcount; if (curstate 1 > 0) then varcount := varcount + 1; -- controller process for matrix multiplication 1 end if: curstate1 := nextstate1; process(clk) variable varcount: unsigned (9 downto 0):= done1 <= vardone1; "0000000000"; end if; variable curstate1, nextstate1: integer := 0; end process; variable vardone1 : std_logic; if (clk='1' and clk'event) then -- datapath process for matrix multiplication 1 case curstate1 is variable s1: integer := 0; when 0 => if start = '1' then variable i, k, j, i2, k2, j2: integer range 0 to 7; nextstate1 := 1; variable varcount, varcount2: unsigned (9 varcount := "0000000000"; downto 0) := "0000000000"; else begin nextstate1 := 0; end if: -- getting the correct indexes for memory fetch ``` ``` -- and memory write when 5 \Rightarrow wait until clk'event and clk = '1'; -- executing pipe stages 1,2,3,4,5 s1 := state1; rd_addr <= conv_std_logic_vector(i,3) & varcount := count; conv_std_logic_vector(k,3); varcount2 := varcount - 6; wait for 4 ns; i := conv_integer( varcount( 8 downto 6 ) ); a <= conv_integer(unsigned(rd_data)); j := conv_integer( varcount( 5 downto 3 )); b \le \cosh(i, k); k := conv_integer( varcount( 2 downto 0 ) ); p1 \le a * b; i2 := conv_integer( varcount2( 8 downto 6 ) ); p2 \le p1; i2 := conv_integer( varcount2( 5 downto 3 )); p3 \le p2; k2 := conv_integer( varcount2( 2 downto 0 )); p4 \le p3; -- executing pipe stages 1,2,3,4,5,6 case s1 is when 6 => when 0 \Rightarrow rd_addr <= conv_std_logic_vector(i,3) & conv_std_logic_vector(k,3); when 1 => wait for 4 ns; -- executing pipe stages 1 a <= conv_integer(unsigned(rd_data)); -a := m1(i,k); b \le \cosh(j, k); rd_addr <= conv_std_logic_vector(i,3) & p1 \le a * b; conv_std_logic_vector(k,3); p2 \le p1; wait for 4 ns; p3 \le p2; a <= conv_integer(unsigned(rd_data)); p4 \le p3: b \le \cosh(j, k); if (k = 5) then sum \le p4; when 2 \Rightarrow else -- executing pipe stages 1,2 sum \le sum + p4; rd_addr <= conv_std_logic_vector(i,3) & end if; conv_std_logic_vector(k,3); wait for 4 ns; when 7 => a <= conv_integer(unsigned(rd_data)); -- executing pipe stages 1,2,3,4,5,6,7 b \le \cosh(j, k); rd_addr <= conv_std_logic_vector(i,3) p1 \le a * b; & conv_std_logic_vector(k,3); wait for 4 ns: when 3 => a <= conv_integer(unsigned(rd_data)); -- executing pipe stages 1,2,3 b \le \cosh(j, k); rd_addr <= conv_std_logic_vector(i,3) & p1 \le a * b; conv_std_logic_vector(k,3); p2 \le p1; wait for 4 ns; p3 \le p2; a <= conv_integer(unsigned(rd_data)); p4 \le p3; b \le \cosh(j, k); if (k = 5) then p1 \le a * b; sum \le p4; p2 \le p1; else sum \le sum + p4; when 4 => end if: -- executing pipe stages 1,2,3,4 if (k = 5 \text{ and varcount } /= 5) then rd_addr <= conv_std_logic_vector(i,3) & if(memswap = '0') then conv_std_logic_vector(k,3); tempblock1(i2, j2) \le sum; wait for 4 ns; elsif(memswap = '1') then a <= conv_integer(unsigned(rd_data)); tempblock2(i2, j2) \le sum; b \le \cosh (i, k); end if; p1 \le a * b; end if: p2 \le p1; p3 \le p2; when 8 => ``` ``` -- executing pipe stages 2,3,4,5,6,7 -- executing pipe stages 5,6,7 p1 \le a * b; p4 \le p3; p2 \le p1; if (k = 5) then p3 \le p2; sum <= p4; p4 \le p3; else if (k = 5) then sum \le sum + p4; sum \le p4; end if: else if (k = 5 \text{ and varcount } /= 5) then sum \le sum + p4; if(memswap = '0') then end if; tempblock1(i2, j2) <= sum; if (k = 5 \text{ and varcount } /= 5) then elsif(memswap = '1') then if(memswap = '0') then tempblock2(i2, j2) <= sum; tempblock1(i2, j2) <= sum; end if: elsif(memswap = '1') then end if: tempblock2(i2, j2) <= sum; end if: when 12 => end if: -- executing pipe stages 6,7 if (k = 5) then when 9 => sum \le p4: -- executing pipe stages 3,4,5,6,7 else p2 \le p1; sum \le sum + p4; p3 \le p2; end if; p4 \le p3; if (k = 5 \text{ and varcount } /= 5) then if (k = 5) then if(memswap = '0') then sum <= p4; tempblock1(i2, i2) <= sum; elsif(memswap = '1') then sum \le sum + p4; tempblock2(i2, j2) <= sum; end if: end if; if (k = 5 \text{ and varcount } /= 5) then end if; if(memswap = '0') then tempblock1(i2, j2) <= sum; when 13 => elsif(memswap = '1') then -- executing pipe stages 7 tempblock2(i2, j2) <= sum; if (k = 5 \text{ and varcount } /= 5) then end if: if(memswap = '0') then end if: tempblock1(i2, j2) <= sum; elsif(memswap = '1') then when 10 => tempblock2(i2, j2) <= sum; -- executing pipe stages 4,5,6,7 end if; p3 \le p2; end if: p4 \le p3; when others => if (k = 5) then end case; sum \le p4; end process; else sum \le sum + p4; -- process for controlling the temp memory end if; if (k = 5 \text{ and varcount } /= 5) then -- swapping if(memswap = '0') then process tempblock1(i2, j2) <= sum; variable countnum: unsigned (9 downto 0):= elsif(memswap = '1') then "0000000000"; tempblock2(i2, j2) <= sum; variable swap : std_logic := '0'; end if: begin end if; wait on count; countnum := count; when 11 => if (countnum = 518) then ``` ``` swap := not swap; nextstate2 := 11: end if: when 11 => memswap <= swap; nextstate2 := 12: end process; when 12 => nextstate2 := 13: when 13 => nextstate2 := 0; --matrix multiplication 2 vardone2 := '1': when others => end case: -- controller process for matrix multiplication 2 state2 <= curstate2; -- synchronized with matrix multiplication 1 by count2 <= varcount; -- waiting the signal Done1 if(curstate2 > 0) then process(clk) varcount := varcount + 1; variable varcount : unsigned (9 downto 0) := end if: "000000000"; curstate2 := nextstate2: variable curstate2, nextstate2: integer := 0; done2 <= vardone2: variable vardone2 : std_logic; end if: end process; if (clk='1' and clk'event) then case curstate2 is -- datapath process for matrix multiplication 2 when 0 => -- pipelining execution is the same as matrix if done1 = '1' then -- multiplication 1 nextstate2 := 1; varcount := "0000000000"; process vardone2 := '0': variable s1 : integer := 0; else variable i, k, j, i2, k2, j2: integer range 0 to 7; nextstate2 := 0; variable varcount, varcount2: unsigned (9 end if: downto 0) := "0000000000"; vardone2 := '0'; begin when 1 => wait until clk'event and clk = '1': nextstate2 := 2: s1 := state2: when 2 \Rightarrow varcount := count2; nextstate2 := 3; varcount2 := varcount - 6; when 3 => i := conv_integer( varcount( 8 downto 6 ) ); nextstate2 := 4; j := conv_integer( varcount( 5 downto 3 )); when 4 \Rightarrow k := conv_integer( varcount( 2 downto 0 )); nextstate2 := 5: i2 := conv_integer( varcount2( 8 downto 6 ) ); when 5 => j2 := conv_integer( varcount2( 5 downto 3 )); nextstate2 := 6; k2 := conv_integer( varcount2( 2 downto 0 ) ); when 6 => nextstate2 := 7; case s1 is when 7 => when 0 => if (varcount < 512) then nextstate2 := 7; when 1 => else if (memswap = '1') then nextstate2 := 8; a_2 \le tempblock1(k, j); end if; elsif (memswap = '0') then when 8 => a_2 \le tempblock2(k, j); nextstate2 := 9; end if: when 9 => b_2 \le \cosh(i, k); nextstate2 := 10; when 10 => when 2 => ``` ``` if (memswap = '1') then else a_2 \le tempblock1(k, j); sum_2 \le sum_2 + p4_2; elsif (memswap = '0') then end if; a 2 \le \text{tempblock2}(k, i); end if: when 7 => b 2 \le \cosh \operatorname{lock}(i, k); if (memswap = '1') then p1_2 \le a_2 * b_2; a_2 \le tempblock1(k, j); elsif (memswap = '0') then when 3 => a 2 \le \text{tempblock2}(k, i): if (memswap = '1') then end if: a_2 \le tempblock1(k, j); b 2 \le \cosh \operatorname{lock}(i, k); elsif (memswap = '0') then p1_2 \le a_2 * b_2; a_2 \le tempblock2(k, j); p2_2 \le p1_2; end if: p3_2 \le p2_2; b_2 \le \cosh(i, k); p4_2 \le p3_2; p1_2 \le a_2 * b_2; if (k = 5) then p2_2 \le p1_2; sum 2 <= p4 2: when 4 => sum_2 \le sum_2 + p4_2; if (memswap = '1') then end if; a_2 \le tempblock1(k, j); if (k = 5 \text{ and varcount } /= 5) then elsif (memswap = '0') then wr_addr <= conv_std logic vec- tor(i2,3) & conv_std_logic_vector(j2,3); a_2 \le tempblock2(k, j); end if: wr_data <= conv_std_logic_vector(- b_2 \le \cosh(i, k); sum_2,32); p1_2 \le a_2 * b_2; wait for 4 ns; p2_2 \le p1_2; end if: p3_2 \le p2_2; when 8 => when 5 => p1_2 \le a_2 * b_2; if (memswap = '1') then p2_2 \le p1_2; a_2 \le tempblock1(k, j); p3_2 \le p2_2; elsif (memswap = '0') then p4_2 \le p3_2; a_2 \le tempblock2(k, j); if (k = 5) then end if; sum_2 \le p4_2; b_2 \le \cosh(i, k); else p1_2 \le a_2 * b_2; sum_2 \le sum_2 + p4_2; p2_2 \le p1_2; p3_2 \le p2_2; if (k = 5 \text{ and varcount } /= 5) then p4_2 \le p3_2; wr_addr <= conv_std_logic_vec- tor(i2,3) & conv_std_logic_vector(j2,3); when 6 => wr_data <= conv_std_logic_vector(- if (memswap = '1') then sum_2,32); a 2 \le \text{tempblock1}(k, j); wait for 4 ns; elsif (memswap = '0') then end if: a_2 \le tempblock2(k, j); when 9 \Rightarrow end if; b_2 \le \cosh(i, k); p2_2 \le p1_2; p3_2 <= p2_2; p1_2 \le a_2 * b_2; p2_2 \le p1_2; p4_2 \le p3_2; p3_2 \le p2_2; if (k = 5) then p4_2 \le p3_2; sum_2 \le p4_2; if (k = 5) then else sum_2 \le p4_2; sum_2 \le sum_2 + p4_2; ``` ``` when 13 => end if; if (k = 5 \text{ and varcount } /= 5) then if (k = 5 \text{ and varcount } /= 5) then wr_addr <= conv_std_logic_vec- wr_addr <= conv_std_logic_vec- tor(i2,3) & conv_std_logic_vector(j2,3); tor(i2,3) & conv_std_logic_vector(j2,3); wr_data <= conv_std_logic_vector(- wr data <= conv std logic_vector(- sum_2,32); sum_2,32); wait for 4 ns: wait for 4 ns; end if: end if: when others => when 10 => end case; end process; p3_2 \le p2_2; p4_2 \le p3_2; if (k = 5) then done <= done2; sum_2 \le p4_2; end behavioral1_pipe15; sum 2 \le sum_2 + p4_2; end if; if (k = 5 \text{ and varcount } /= 5) then 10.3.4 IDCT Optimized Algorithm wr_addr <= conv_std_logic_vec- tor(i2,3) & conv_std_logic_vector(j2,3); Behavior of Pipelined Design. wr_data <= conv_std_logic_vector(- sum_2,32); wait for 4 ns; end if: library ieee; use ieee.std_logic_1164.all; when 11 => use ieee.std_logic_misc.all; p4 \ 2 \le p3 \ 2; use ieee.std_logic_arith.all; if (k = 5) then use ieee.std_logic_components.all; sum_2 \le p4_2; sum_2 \le sum_2 + p4_2; end if; entity idct is if (k = 5 \text{ and varcount } /= 5) then clk: in std_logic; port ( wr_addr <= conv_std_logic_vec- rd_data: in std_logic_vector(15 downto 0); tor(i2,3) & conv_std_logic_vector(j2,3); rd_data2: in std_logic_vector(31 downto 0); wr_data <= conv_std_logic_vector(- start: in std_logic; sum_2,32); done : out std_logic; wait for 4 ns; rd_addr: out std_logic_vector(5 downto 0); end if: rd_addr2: out std_logic_vector(5 downto 0); wr_addr : out std_logic_vector(5 downto 0); when 12 => wr_data: out std_logic_vector(31 downto 0)); if (k = 5) then end idct; sum_2 \le p4_2; sum_2 \le sum_2 + p4_2; architecture behavioral2_pipe2 of idct is end if: type rf is array(0 to 7,0 to 7) of integer; if (k = 5 \text{ and varcount } /= 5) then wr addr <= conv_std_logic_vec- signal a,b,c,c1,c2,c3,c4,d,p1,p2,p3,p4,p5,p6,p7,p8, tor(i2,3) & conv_std_logic_vector(j2,3); s,sum,temp,temp1: integer:=0; wr_data <= conv_std_logic_vector(- signal i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13, sum_2,32); j1,j2,j3,j4,j5,j6,j7,j8,j9,j10,j11,j12,j13, wait for 4 ns; k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13: end if: integer; ``` ``` signal cos: rf; constant period: time := 4 ns; -- pipeline contrller 1 signal en: std_logic_vector(0to process(clk) 12):="0000000000000"; variable 1: integer := 0; signal start1, start2: std_logic := '0'; begin signal state1, state2, state: integer:=0; if clk='1' then begin case state1 is cos <= ( when 0 => (125, 122, 115, 103, 88, 69, 47, if start1 = '1' then (125, 103, 47, -24, -88, -122, -115, -69), state1 <= 1: (125, 69, -47, -122, -88, 24, 115, 103), en(5) \le en(4) after period; (125, 24, -115, -69, 88, 103, -47, -122), en(4) <= en(3) after period; (125, -24, -115, 69, 88, -103, -47, 122), en(3) \le en(2) after period; (125, -69, -47, 122, -88, -24, 115, -103), en(2) \le en(1) after period; (125, -103, 47, 24, -88, 122, -115, 69), en(1) \le en(0) after period; (125, -122, 115, -103, 88, -69, 47, -24) en(0) <= '1' after period; ); il <= 1/64 after period; j1 <= (1 mod 64) / 8 after period; -- top level controller (controlling the two pipeline k1 <= 1 mod 8 after period; -- controllers) 1 := 1+1: else process(clk) en(5) \le en(4) after period; variable l:integer:=0; en(4) \le en(3) after period; begin en(3) <= en(2) after period; if clk='1' then en(2) \le en(1) after period; case state is en(1) \le en(0) after period; when 0 => en(0) \le '0' after period; if start='1' then 1 := 0; i1 <= 1/64 after period; state <= 1; start1 <='1': j1 \le (1 \mod 64) / 8 after period; start2 <='0'; k1 <= 1 mod 8 after period; 1 := 1; end if: start1 <= '0'; when 1 => start2 <= '0'; if 1<512 then 1 := 0; en(5) \le en(4) after period; end if; en(4) \le en(3) after period; en(3) \le en(2) after period; when 1 => en(2) \le en(1) after period; if I<13 then en(1) \le en(0) after period; start1 <= '0'; en(0) <= '1' after period; start2 <= '0'; i1 <= 1/64 after period; 1 := 1 + 1; j1 \le (1 \mod 64) / 8 after period; k1 <= 1 mod 8 after period; start1 <='0'; 1 := 1+1; start2 <='1'; else state \leq 0; en(5) \le en(4) after period; end if; en(4) \le en(3) after period; en(3) \le en(2) after period; when others => en(2) \le en(1) after period; end case: en(1) \le en(0) after period; end if; en(0) <= '0' after period; end process; 1 := 0; il <= 1/64 after period; ``` ``` i1 \le (1 \mod 64) / 8 after period; k7 <= 1 mod 8 after period; 1 := 1+1; k1 <= 1 mod 8 after period; else state 1 \le 0; en(12) \le en(11) after period; end if; when others => en(11) \le en(10) after period; en(10) \le en(9) after period; end case: en(9) \le en(8) after period; end if; en(8) \le en(7) after period; end process; en(7) \le en(6) after period; en(6) <= '0' after period; -- pipeline controller 2 1 := 0; process(clk) i7 \le 1/64 after period; variable 1: integer := 0; i7 <= (1 mod 64) / 8 after period; begin if clk = '1' then k7 <= 1 mod 8 after period; state 2 \le 0; case state2 is when 0 \Rightarrow end if: if start2 = '1' then when others => state2 \le 1; en(12) \le en(11) after period; end case: en(11) \le en(10) after period; end if: en(10) \le en(9) after period; end process; en(9) \le en(8) after period; en(8) \le en(7) after period; en(7) \le en(6) after period; en(6) <= '1' after period; -- matrix 1 i7 <= 1/64 after period; i7 <= (1 mod 64) / 8 after period; k7 \le 1 \mod 8 after period; process begin 1 := 1+1; wait on clk; else if (clk='1') and (en(0) ='1') then en(12) \le en(11) after period; en(11) \le en(10) after period; i2 <= i1 after period; j2 <= j1 after period; en(10) \le en(9) after period; en(9) \le en(8) after period; k2 <= k1 after period; b \le \cos(i1,k1) after period; en(8) \le en(7) after period; -- a <= m1(i,k);-- reading m1 en(7) \le en(6) after period; rd_addr <= conv_std_logic_vector(i1,3) & en(6) <= '0' after period; 1 := 0; conv_std_logic_vector(k1,3); i7 <= 1/64 after period; i7 <= (1 mod 64) / 8 after period; wait for period; a <= conv_integer(unsigned(rd_data)); k7 <= 1 mod 8 after period; end if: end if: end process; when 1 => if 1<512 then process(clk) en(12) \le en(11) after period; begin if (clk='1') and (en(1) ='1') then en(11) <= en(10) after period; p1 <= a * b after period; en(10) \le en(9) after period; i3 <= i2 after period; en(9) \le en(8) after period; i3 <= i2 after period; en(8) \le en(7) after period; k3 <= k2 after period; en(7) \le en(6) after period; en(6) <= '1' after period; end if: end process; i7 \le 1/64 after period; i7 \le (1 \mod 64) / 8 after period; ``` ``` process(clk) k8 <= k7 after period; begin temp1 <= temp after period; if (clk='1') and (en(2) ='1') then d \le \cos(k7,i7) after period; p2 <= p1 after period; if (i7/=0) then i4 <= i3 after period; -c \le m2(k,j); i4 <= i3 after period: rd_addr2 <= conv_std_logic_vector(k7,3) k4 <= k3 after period; & conv_std_logic_vector(j7,3); end if; end process; wait for period; c <= conv_integer(signed(rd_data2)); process(clk) end if: begin end if; if (clk='1') and (en(3) ='1') then end process; p3 <= p2 after period; i5 <= i4 after period; process(clk) j5 <= j4 after period; begin k5 <= k4 after period; if (clk='1') and (en(7)='1') then end if: p5 <= d * temp1 after period; end process; i9 <= i8 after period; i9 <= i8 after period; process(clk) k9 <= k8 after period; begin c1 <= c after period; if (clk='1') and (en(4) ='1') then end if: p4 <= p3 after period; end process; i6 <= i5 after period: j6 <= j5 after period; process(clk) k6 <= k5 after period; begin end if: if (clk='1') and (en(8) ='1') then end process; p6 <= p5 after period; i10 <= i9 after period; process(clk) i10 <= i9 after period; begin k10 <= k9 after period; if (clk='1') and (en(5) ='1') then c2 <= c1 after period; if (k6=0) then end if; s <= p4 after period; end process; else s \le s + p4 after period; process(clk) if (k6=7) then begin temp \leq s + p4 after period; if (clk='1') and (en(9) ='1') then end if: p7 <= p6 after period; end if: ill <= il0 after period; end if; j11 <= j10 after period; end process; k11 <= k10 after period; c3 <= c2 after period; end if: -- matrix 2 end process; process(clk) process begin begin if (clk='1') and (en(10) ='1') then wait on clk; p8 <= p7 after period; if (clk='1') and (en(6) ='1') then i12 <= i11 after period; i8 <= i7 after period; j12 <= j11 after period; j8 <= j7 after period; k12 <= k11 after period; ``` ``` c4 <= c3 after period; end if; end process; process(clk) begin if (clk='1') and (en(11) ='1') then if (i12=0) then sum <= p8 after period; else sum <= c4 + p8 after period; end if; i13 <= i12 after period; j13 <= j12 after period; k13 <= k12 after period; end if: end process; process(clk) begin if (clk='1') and (start='1') then done <= '0'; end if; if (clk='1') and (en(12)='1') then -- m2(k,j) \le sum; wr_addr <= conv_std_logic_vector(k13,3) & conv_std_logic_vector(j13,3) after period; wr_data <= conv_std_logic_vector(sum, 32) after period; if (i13=7) and (j13=7) and (k13=7) then done <='1' after 2*period, '0' after 3 * period; end if; end if; end process; end behavioral2_pipe2; ``` Page From: "J. DeWayne Green" <dgreen@binky.ICS.UCI.EDU> To: christy@binky.ICS.UCI.EDU Cc: dgreen@binky.ICS.UCI.EDU, pazzani@binky.ICS.UCI.EDU Subject: New Confect 14.06630, 0700 Date: Thu, 19 Sep 1996 14:06:39 -0700 Christy, Juancho and I just made a new conference room (CS 430 C) which you can now begin to schedule as needed. Check it out for size, seating, etc. Let me know if you have any questions. DeW J. DeWayne Green Phone: (714) 824-7403 Fax: (714) 824-3976 E-mail: jdgreen@uci.edu Administrative Officer Information and Computer Science (ICS) University of California, Irvine URL: www.ics.uci.edu/~dgreen From: Padhraic Smyth <smyth@galway.ICS.UCI.EDU> To: Christina Crandall <christy@binky.ICS.UCI.EDU> Subject: Re: visitor on Oct. 25th Date: Thu, 19 Sep 1996 07:49:42 -0700 In email message <9609181346.aa26729@paris.ics.uci.edu>, Christina Crandall >Hi, >I'm sorry that it took so long for me to >reply to your email regarding your visitor >on the 25th of October. I have been out >of the office for about the last week and >a half. no problem >The 432 & 438 conference room has already >been reserved for you ( I believe that >Bernie took care of that), for 10:30 ? > I will be happy >to make up an announcement of the talk and >to distribute it. I will also make a schedule >sign-up sheet for people to meet with your >visitor. thanks >Would you like refreshments to be served >before the talk? If so please let me know. Yes, just coffee would be fine. >Also, does Carla Brodley need a parking permit >for the 25th? Yes. Do you want her mailing address to send it to her? Padhraic From: "Isaac D. Scherson" <isaac@mars.ICS.UCI.EDU> To: ICS Front Office <foffice@binky.ICS.UCI.EDU> Cc: christy@binky.ICS.UCI.EDU Subject: Re: Books published in 1995 or 1996 Date: Wed, 21 Aug 1996 13:11:56 -0700 In message <9608200940.aa13384@paris.ics.uci.edu> you write: > Has anyone besides Lubomir Bic, Jonathan Grudin, > Rick Selby, or Padraic Smyth authored a book > published in 1995 or 1996? If you have please > let me know. We would like to include a paragraph > on the books published within that time frame in > the Computing Research Review. > > Thank you, > Christy My IEEE book (on display by the business office) is in its second edition and was published within the period you request. isaac 1 From: "J. Chris Leiker" <chris@concorde.ICS.UCI.EDU> To: Christina Crandall <christy@binky.ICS.UCI.EDU> Subject: Re: emergency phone list Date: Wed, 04 Sep 1996 11:02:33 –0700 Hi Christy: Sorry for the delay on this, but here's the info you requested. Name: Chris Leiker address: 4 Willowood Laguna Hills, CA 92656 (714) 362-4636 Emergency contact: Dennis Leiker, husband, (619) 536-3873/pager It is ok to release this info to ICS faculty & staff ONLY in case of emergency. I would not like this released to other UCI employees, though. Thanks, Chris