Distributed and fault-tolerant execution framework for transaction processing
Abstract
There is a growing need for efficient distributed computing for transaction processing. One of the key requirements for runtime systems in distributed environments is fault tolerance. Such a system needs to preserve the data consistency at transaction boundaries so as to resume the ongoing tasks from checkpoints with consistent data for any component failure. Another key requirement is that the system needs to be lightweight enough in normal execution to provide scalable performance. This paper presents the design and implementation of a new fault tolerant execution framework that addresses both of these requirements. We replicate each partition of the distributed persistent data on three nodes (triplet) with two different types of backups, one using warm replication and the other using cold replication. For node failures, the system is automatically recoverable unless all three nodes in any triplet fail at the same time. The system tolerates simultaneous two-node failures in any triplet most of the cases. We obtained a new trade-off in that 43% performance improvements can be achieved by slightly compromising the system availability. Copyright 2011 ACM.