Analyzing and improving performance scalability of commercial server workloads on a chip multiprocessor
Abstract
A chip multiprocessor (CMP) with many low performance cores can achieve high performance or high performance/power for commercial server applications. The large number of hardware threads of a CMP with many low performance cores poses significant challenges to application developers in writing scalable applications. Many papers have assessed the architectural characteristics and the performance scalability, and some of them have identified lock contention as one of the scalability bottlenecks. However, there are few studies that resolved these problems, analyzed their causes, and compared the architectural characteristics before and after the scalability limitations were addressed. We analyzed and resolved some of the problems limiting the scalability of three commercial server applications with 64 hardware threads. We also did before and after comparisons of the architectural characteristics affected by the scalability enhancements, supporting the development of new processors. We addressed the lock contention with changes in the Java code. Our enhancements improved the performance scalability by up to 132%. We show that though the causes of lock contention are in different software layers, they share certain similarities and can be organized in three categories. Our comparisons reveal that the CPI and data TLB miss rates decrease, but the L2 data cache miss rates, L2 instruction cache miss rates, and memory traffic increase. These results suggest that we need to address the performance scalability problems of an application before we can accurately measure the architectural characteristics of a CMP. © 2009 IEEE.