Question 1: Vector architecture exploits the data-level parallelism to accomplish significant speedup. For programmers, it is generally be make the problem/data bigger. For example, programmers ten years ago might want to model a map with a 1000 x 1000 single-precision floating-point array, but might now want to do this with a 5000 x 5000 double-precision floating-point array. Evidently, there is abundant data-level parallelism to explore. Give some reasons why computer architecture don’t intend to make a super-big vector machine (in terms of the number and the length of vector registers) to take the benefit of this opportunity?
Question 2: What are the merits and demerits of fine-grained multithreading, coarse-grained multithreading and simultaneous multithreading?
Question 3: Consider a system with two multiprocessors with the given configurations:
a) Machine 1, a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word.
b) Machine 2, a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word.
Assume that an application has two threads running on the two processors, each of them require accessing an entire array of 4096 words, and is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it instead of the UMA machine? If so, specify the partitioning. If not, by how many more cycles must the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Suppose that the memory operations dominate the execution time.