Skip to main content

Integration & IT Modernization

Avoiding Disasters with Chaos Engineering

Healthcare

The cost of an outage to a business is very high. This isn’t news to anyone, but in today’s world of instant access and immediate gratification, it’s truer now than ever. Businesses today are expected to be accessible at all times to consumers.

As a result, businesses need to build infrastructure and applications that are resilient. By doing so, you are able to avoid outages that may have been avoidable – and that can become disasters.

While the need for resiliency is known, there are newer techniques to ensure resiliency that are becoming more of a norm. One such example is chaos engineering.

What is chaos engineering?

Chaos engineering’s roots can be traced back to Netflix’s Chaos Monkey tool, which was designed to test Netflix systems’ stability after moving to the cloud. To do this, Netflix would intentionally inject outages into the systems to see how the overall environment would react. The idea was that causing failures could help the IT organization to identify weaknesses in the system to ensure that they would not be issues in the future.

It is from here that chaos engineering grew. However, much like Chaos Monkey wasn’t just chaos, chaos engineering too is more scientific than that.

Instead, chaos engineering is about conducting carefully designed experiments on hypotheses that the IT organization has about applications and infrastructures. The idea is not to take down the entire system, but, in many ways, the goal of chaos engineering is the same as the goal of Chaos Monkey – ensuring resiliency.

When is chaos engineering used?

It’s important to run experiments early. Ensuring applications and infrastructure are resilient before they are in production is vital. However, while chaos engineering can begin before an application or infrastructure goes live, it should continue to be used after environments go live.

This is because applications, infrastructure, and their interconnections are constantly changing – new applications are being produced, patches are being made, and environments remain fluid. There are many touchpoints to most applications – both internally and externally – and data is often shared. Any of that can change the way an applications works, which means experiments need to be continued.

It is important to take a calculated approach to the experiments at all times, but especially when the applications and infrastructure are live. Chaos engineering experiments generally begin with a small footprint, in a contained environment. As resiliency is proven on a smaller scale, the size and level of experiment can grow, to help determine where the breaking points lie, so that those break points can be eliminated or a contingency plan can be created.

How does chaos engineering help avert disasters?

As previously highlighted, the cost of an outage can be extremely high for businesses. There are many points at which an application or infrastructure can falter too. As such, you must account for everything you can to ensure applications and infrastructure remain up.

Through chaos engineering, you run experiments that purposefully identify where failure may occur. These experiments are informed by the assumptions made when you set up an applications or build an infrastructure. These assumptions become your hypotheses in your chaos engineering experiments, which can be as simple as shutting down a server or finding out what happens if the CPU utilization on a server increases rapidly.

The idea is to identify faults and potentially build solutions to avoid them so that they don’t result in outages in the future. In many cases, this may involve reaching out to the vendors, and partners, to address findings on their end. These fixes may be difficult, but it’s important to find them before they become costly.

Taking a long-term view

At Perficient, we believe chaos engineering is important. In an ideal world, teams should have people dedicated to running these experiments. Businesses are slowly seeing this too, and the practice is becoming more commonplace. Building more resilient applications and infrastructure is good business, and worth the additional time and cost in the long run.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.