Publication
SC 2011
Conference paper
Poster: Scalable infrastructure to support supercomputer resiliency-aware applications and load balancing
Abstract
High performance computing systems display increasing complexity and component counts. This trend exposes weak-nesses in the underlying clustering infrastructure needed for continuous availability, maximizing utilization, and efficient administration of such systems. To mitigate the problem, we present a highly scalable clustering infrastructure, based on peer-to-peer technologies, for supporting resiliency-aware applications as well as efficient monitoring and load balancing. Supported services include Membership, Publishsubscribe messaging, Convergecast, Attribute replication and a DHT. We present a preliminary evaluation taken from an IBM BlueGene/P, demonstrating scalability up to ∼ 256K nodes.