Matrix multiplication on hypercubes using full bandwidth and constant storage
Abstract
For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about P2/√N elements of each input operand, with operands of size P×P distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P2/√ log N for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P3, and parallel arithmetic complexity 2P3/N. The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.