Managing risk in multi-node automation of endpoint management
Abstract
Endpoint management, including patching, health checking, configuration etc., is a key function for data center and cloud management. Managing multiple nodes through automation tools or scripts significantly increases efficiency. However, the risks of adverse impact due to excessive privilege or human error may propagate to a large pool of endpoints and lead to massive service disruptions and SLA (Service Level Agreement) violations. In this paper, we present a system that proactively and systematically manages the risk throughout the lifecycle phases of automation. We present a prototype implementation consisting of an authorization mechanism that guarantees the right level of eligibility and privilege of accessing the automation content (during the deployment stage), and an execution validator that controls the risk of human error which may cause massive damage to the infrastructure (during execution of the automation content). Our current implementation has been deployed to more than a dozen customer environments and achieved an efficiency gain of 58% with high execution accuracy. © 2014 IEEE.