Problem
Devise a minimum-time algorithm to multiply two 64 x 64 matrices, A = (0,) and 8 = (EV, on an SIMD machine consisting of 64 PEs with local memory. The 64 PEs are interconnected by a 2D 8 x 8 torus with bidirectional links.
(a) Show the initial distribution of the input matrix elements (a4 and (b) on the PE memories.
(b) Specify the SIMD instructions needed to carry out the matrix multiplication. Assume that each PE can perform one multiply, one odd, or one shift (shifting data to one of its four neighbors) operation per cycle. You should first compute all the multiply and odd operations on local data before starting to route data to neighboring PEs. The SIMD shift operations can be either east, west, south. or north with wraparound connections on the COILS. (c) Estimate the total number of SIMD instruction cycles needed to compute the matrix multiplication. The time includes all arithmetic and data-routing operations. The final product elements C = A x 8 = (cf) end up in various PE memories without duplication.
(c) Estimate the total number of SIMD instruction cycles needed to compute the matrix multiplication. The time includes all arithmetic and data-routing operations. The final product elements C = A x 8 = (c1) end up in various PE memories without duplication.
(d) Suppose data duplication is allowed initially by loading the same data element into multiple PE memories. Devise a new algorithm to further reduce the SIMD execution time. The initial data duplication time, using either data broadcast Instructions or data touting (shifting) instructions, must be counted. Again, each result element cl ends up in only one PE memory.