By Suresh Gowda, ICP-ACC, CSM, CSPO, SPC4, AWS-CSA-Associate
Senior Consultant, Impact Makers
How to quickly recover from services outages
Service Oriented Architecture (SOA), microservices, and cloud best practices can result in the creation of loosely coupled complex systems with many potential points-of-failure. At Impact Makers, we observe many customers dealing with service outages, which are difficult to diagnose and more importantly, difficult to recover from quickly. Service outages have significant financial impact; therefore, it is incumbent upon software architects and engineers to find ways to prove the systems we build and deploy make it easy to detect and recover from failures.
Chaos Engineering & how it can help
Chaos Engineering is a disciplined approach to identifying system failures before they become full blown outages. This involves hypothesizing failure scenarios, deliberately sabotaging components of complex systems to cause and remediate resulting service outages.
Chaos Engineering’s end goal is to perform these acts of deliberate sabotage in production environments to assure that the system is built to avoid service outages in the environment in which critical services are being provided. Unfortunately, no matter how hard we try, it is relatively impossible to simulate production environments and situations in lower environments typically available to validate and test solutions.
Why a careful, planned approach yields resilient systems
Of course, this doesn’t mean that you jump right in and chaotically induce component failures in production systems. A carefully planned approach to identifying failure scenarios, potential solutions, implementation, and then deliberate testing is the best way to build and maintain resilient systems.
Too many organizations start their foray into Chaos Engineering with extremely aggressive hypotheses resulting in a wide blast radius. When the extent of impact is extremely large, it is likely to result in barriers to explore further scenarios.
Right-sizing your hypotheses to minimize the blast radius and creating scenarios is a science that comes with experience. Decompose an initial hypothesis into small experiments that minimize the impact and affected areas. To be successful requires extensive knowledge of all components involved in resiliency design. Expertise in all aspects of operational systems design prepares for chaos activities that successfully demonstrate the value of Chaos Engineering without causing harm to the organization.
Learn more: The Impact Makers Solution
Impact Makers’ Assess/Identify/Implement/Evaluate methodology is well-suited for introducing Chaos Engineering to organizations. An initial assessment phase involves conducting an inventory of systems, components, and architectures to establish the context in which systems are deployed. Subsequent steps can take your organization through a managed process of system readiness for conducting deliberate sabotage in production as a “Game Day” exercise to assure that systems are behaving as desired. The end goal is to ensure easy detection and faster recovery for your organization’s future service outages.
To learn more, contact us.