If you need to optimize execution speed by maximizing parallel execution of a matrix multiply, list other ideas you have about improving the inner loop.
Perhaps restructuring the nest of 3 loops?
Read about "tiling" (AKA "blocking") in the literature; e.g. here: htt df Just outline in a brief paragraph the principle of what can be done.