Assuming this loop is taken many times, what is the steady-state CPI of this loop on the scalar pipeline discussed in class, with forwarding, branch resolution done in the ID stage, and no branch delay slot?
loop:lw $6, 4000($7)
add $9, $6, $3
or $5, $9, $6
lw $2, 2000($5)
add $3, $9, $2
subi $5, $5, 12
sw $9, 2000($3)
bne $9, $0, loop
Now assuming the same machine but with a branch delay slot, rearrange this code to improve performance, and give me the new CPI.