High-Performance Computer Architecture Midterm Exam
Problem 1 - Amdahl's Law
We are optimizing a program which spends 25% of its time in third-party libraries. We do not have source code for those libraries and, as a result, cannot apply any optimizations to them. What speedup do we need on the remaining 75% of the program to achieve an overall speedup of 2 on the entire program?
Problem 2 - Processor Performance Equation
We are working on a new processor, Techium 5, which is going to replace the old Techium 4. Techium 5 has an improved ISA that results in executing only half as many instructions as Techium 4 for the same program. Additionally, the CPI on Techium 5 is reduced by 40% and clock frequency is increased by 25% compared to Techium 4. What is the overall speedup of moving from Techium 4 to Techium 5?
Problem 3 - Branch Prediction
The accuracy of a predictor is the fraction of all predictions that turn out to be correct. Asymptotic accuracy of a predictor is its accuracy when the time over which we observe its behavior approaches infinity. A processor is executing a long-running program in which the most frequently occurring branch is having the following repeating sequence of outcomes:
T, T, NT, T, NT, T, T, NT, T, NT, etc.
(twice T, once NT, once T, once NT, then repeat)
A) What is the asymptotic accuracy for a one-bit predictor on this branch?
B) What is the asymptotic accuracy of a two-bit counter predictor on this branch? (Assume that we are using the using the normal 0 ↔ 1 ↔ 2 ↔ 3 counter for the 2-bit predictor)
C) What is the asymptotic accuracy on this branch for a local history predictor with a two-bit history and (four) one-bit counters?
D) Can we design a local history predictor with K bits of history and one-bit counters that would have asymptotic accuracy of 1 (always correct after the initial warm-up period) on this branch? If yes, how many bits of history do we need? If no, why not?
Problem 4 - Predication
Consider the code fragment shown below. We show source code, the actual instructions the compiler generates without using predication, and the instructions generated when using conditional-move predication. The predicated code is not complete - make sure you complete the last instruction (MOVN) in the predicated code before you answer the rest of this question.
Source
|
MIPS (No predication)
|
MIPS (predication)
|
if cond then
a= a - 1;
b = b + 1;
endif
|
BEQZ R1, End
ADDI R2, R2, -1
ADDI R3, R3, 1
End:
|
ADDI R8, R2, -1
ADDI R9, R3, 1
MOVN R2, R8, R1 ; if R1!=0 then R2 = R8
MOVN R _____, R _____, R _____
|
Without branch mispredictions, the processor achieves the CPI of 1, i.e. it completes one instruction every cycle. A branch misprediction results in the one cycle to complete the branch instruction, plus 10 wasted cycles.
Suppose that "cond" is true 90% of the time, i.e. out of 1000 times that this code is executes, R1 is zero 100 times and is non-zero 900 times. Recall that "BEQZ" stands for "branch on equal to zero" and that the "N" in MOVN stands for "non-zero".
- With a perfect branch predictor (no branch misprediction penalty ever), what is the average execution time of this code fragment without predication? What is the average execution time with predication? Which would you use? Note: You can compute the average execution time of this fragment by computing how long it takes to execute the 1000 execution of the fragment as illustrated above, then dividing that number by 1000.
- Suppose now that we have a branch predictor whose overall accuracy is 50%. What is now the average execution time with predication and without predication? Which would you use?
- The predicated code performs better than the non-predicated code when the overall branch predictor accuracy is below ______ percent. Show your work.
Problem 5 -
A one-wide processor uses out-of order execution with an ROB. The processor has two unified reservation stations (RS0 andRS1) and four ROB entries (ROB0 through ROB3). There are three fully pipelined execution units, one for branches and two for all other instructions. Execution takes two cycles for any instruction. Every instruction spends a cycle for a CDB broadcast, even those instructions that don't produce any values. A reservation station is freed in the cycle in which its instruction begins execution, and another instruction can use that RS in the next cycle after that. Similarly, a ROB entry can be reused in the next cycle after the one in which it was freed. An instruction that is waiting for a result to be produced can begin execution in the next cycle after the one in which its last missing operand is broadcast. Assume that all branches are correctly predicted, that the ROB is empty at the beginning of the starting cycle (cycle 1), and that the following instructions are already fetched, decoded, and waiting in a buffer to be issued:
ADD R1, R2, R2
ADD R2, R1, R3
SUB R3, R0, R4
BNE R1, R2, Label1
ADD R4, R1, R4
BNE R3, R0, Label2
ADD R3, R3, R4
Note that this is not a listing of a program, this is a list of instructions in the pre-issue buffer, in the order in which they appear in that buffer. In other words, branches have already been (correctly) predicted and these are the instructions that have been fetched and decoded.
A) Which instruction is the last one to commit? Why?
B) What is the first cycle in which a branch is committed? Why?
C) What is the first cycle in which no instruction is issued? Why?
D) Which tag (name) is broadcast on the CDB when the last ADD is executed? Why?
Problem 6 - Tomasulo's Algorithm
A processor uses Tomasulo's algorithm in its floating-point unit, and is executing the sequence of instructions given in the first table on this page. You don't know what the execution times are for various instructions, but you do know the following
- There are 3 reservation stations, 2 for the ADD unit and 1 for the MUL unit. If multiple reservation stations of the right kind are free, the processor uses the first free one (the RS with the lowest number).
- When a reservation station is freed in some cycle, it cannot be used by another instruction in the same cycle in which it was freed.
- There is a cycle (let's call it Cycle X) during which the processor's CDB is used for the very first time, and the result that is broadcast on the CDB during cycle X is a result of an ADD instruction.
A) At the end of cycle X, for each instruction determine whether or not it has issued, began execution, and whether its result has been written. To help you out, we have already filled out this data for the first instruction. Note: "Execute" means that the instruction has been sent from the RS to the execution unit, but such an instruction may or may not have completed execution.
|
Instruction Status
|
|
Issue?
|
Execute?
|
Wr. Result?
|
MUL F1, F2, F3
|
Yes
|
Yes
|
No
|
ADD F1, F0, F1
|
|
|
|
ADD F2, F3, F2
|
|
|
|
ADD F3, F1, F4
|
|
|
|
ADD F0, F2, F3
|
|
|
|
B) What is the content of the reservation stations at the end of cycle X?
Reservation Stations
|
Name
|
Busy?
|
Op
|
Vj
|
Vk
|
Qj
|
Qk
|
Add1
|
|
|
|
|
|
|
Add2
|
|
|
|
|
|
|
Mul1
|
|
|
|
|
|
|
C) At the beginning of cycle X, what is the content of the processor's RAT?
Register
|
F0
|
F1
|
F2
|
F3
|
F4
|
Mapping
|
|
|
|
|
|
Problem 7 - Compiler Optimizations and VLIW
We have a VLIW processor with 16 registers and the following units:
1) A branch unit (for branch and jump operations), with a one-cycle latency.
2) A load/store (LW/SW operations) unit, with a six-cycle latency.
3) Two arithmetic units for all other operations, with a two-cycle latency.
Each instruction specifies what each of these four units should be doing in each cycle. The processor does not check for dependences nor does it stall to avoid violating program dependences - the program must be written so that all dependences are satisfied. For example, when a load-containing instruction is executed, the next five instructions cannot use that value. If we execute the following loop (each row represents one VLIW instruction), which performs element-wise addition of two 60-element vectors (whose address are in R0 in R1) into a third (address is in R2), where the end address of the result vector is in R3, and each element is a 32-bit integer:
Label
|
Branch Unit
|
LD/ST Unit
|
Arith. Unit 1
|
Arith. Unit 2
|
Loop:
|
NOP
|
LW R4, 0(R0)
|
NOP
|
NOP
|
|
NOP
|
LW R5, 0(R1)
|
ADDI R4, R4, 4
|
ADDI R2, R2, 4
|
|
NOP
|
NOP
|
ADDI R5, R5, 4
|
NOP
|
|
NOP
|
NOP
|
NOP
|
NOP
|
|
NOP
|
NOP
|
NOP
|
NOP
|
|
NOP
|
NOP
|
NOP
|
NOP
|
|
NOP
|
NOP
|
NOP
|
NOP
|
|
NOP
|
NOP
|
ADD R6, R4, R5
|
NOP
|
|
NOP
|
NOP
|
NOP
|
NOP
|
|
BNE R2, R3, Loop
|
SW R6, -4(R2)
|
NOP
|
NOP
|
Note how we increment the vector points while waiting for loads to complete. Also note how each operation reads its source registers in the first cycle of its execution, so we can modify source registers in subsequent cycles. Each operation also writes its result registers in the last cycle, so we can read the old value of the register until then. Overall, this code takes ten cycles per element, for a total of 600 cycles.
A) If we unroll this loop twice and re-schedule the code in each (new) iteration, what does the code look like (use only as many instruction slots in the table below as you need) and how many cycles does the entire 60-iteration (20 iterations after unrolling) loop take now?
B) If we apply software pipelining to split the loop (the one from page 6, not your unrolled loop from page 7) into two stages (one loads , the other adds and stores) and then re-schedule the code in each iteration, what does the code look like and how many cycles does the entire loop take now?
Your code for the SW-pipelined loop (show prologue and epilogue code, too).