How to Keep an IT Mistake From Leading to a Cascade of Errors

Don’t let your IT organization fall victim to a runaway “error string”. Preventative planning can help you stop a possible cavalcade of flaws.

John Edwards, Technology Journalist & Author

May 24, 2023

4 Min Read

antique domino set tumbling on marble chess board

mccool via Alamy Stock

An “error string” describes a software scenario in which one negative result leads to another incorrect outcome, which then leads to a runaway string of inaccuracies. It’s a problem that more than a few IT leaders have already experienced.

“When one mistake can make it past your checks and balances system, it can easily lead to more mistakes being made,” warns Troy Portillo, director of operations for Studypool, an online learning platform.

Whenever a mistake is detected, it’s always best policy to immediately fix the error, Portillo advises. “The longer you wait once a mistake has been brought to your attention, the higher the probability it will lead to a cascade of mistakes,” he says. “No system is perfect but acting quickly once you’re alerted to an issue is the best way to avoid future mistakes stemming from the original error.”

Root Causes

IT functions something like a circuit in which all the breakers in a line must be shut before a current can flow through it, explains Tommy Gardner, CTO at HP Federal. “Often, multiple errors have to occur sequentially prior to … a dramatic loss of data,” he says.

A cascading series of errors can occur when essential preventative steps are overlooked. “For example, continuity of operations can cause the IT team to focus on a perceived threat,” Gardner says. “With the time and attention of the staff diverted, perhaps the routine updates aren’t done.”

Perils of Poor Design

Error strings can often be traced back to poor system design. “If a system isn’t designed to properly handle errors between subsystems, a single mistake can cause the system to fail, leading to a series of errors,” explains Tom Chisholm, principal training solutions engineer at software developer Perforce.

“For example, if a web application doesn’t handle database connection errors correctly, a single error can cause the entire application to crash if other parts of the system aren’t resilient to such a failure,” Chisholm says. Poorly written database queries can also initiate a failure chain. “This can lead to database deadlocks, which in turn can lead to cascading failures in the frontend,” he notes.

Preventative Measures

Gardner believes that building a proactive defense is the best way to prevent a potential catastrophe. “IT teams should think through multiple problems at once, understand the limits and constraints of their system, and build structured protocols to adhere to on an ongoing basis,” he says. Gardner suggests training team members on IT best-practices so small user errors don’t morph into larger problems.

Gardner also advises independently testing software, as well as scheduling regular code updates, to ensure that both open-source and proprietary software aren’t harboring any vulnerable soft spots.

Chisholm agrees. He suggests thoroughly testing software before launching it. “Find the single points of failure in a controlled environment before you find them in production,” he recommends.

Meanwhile, the best way to prevent or recover from cascading failures is to build fault tolerance into every subsystem and to periodically test for fault tolerance, Chisholm says.

Chisholm also recommends using monitoring tools to keep an eye on system health and performance. “Be proactive in addressing any issues that arise,” he states. “Additionally, regularly reviewing logs and metrics can help identify potential problems before they become major issues.”

Breaking the String

Despite careful planning, a project can still fall victim to sequential errors. “System knowledge is the best way to avoid a string of errors,” Gardner counsels. Creating a break anywhere in the process can stop an error string in its tracks.

Practice makes perfect -- usually. Gardner suggests presenting IT team members with intentionally wild error string scenarios, and then challenging them to create effective ways of stopping them. “This can be a fun tabletop exercise and, if you can create collaboration across the security and IT teams, you’re better positioned to avoid vulnerabilities and product functionality issues,” he says.

Keep Calm and Carry On

Escaping from a cascading string of failures requires an unemotional and reasoned response. “Often, by the time you have a cascade, it’s too late to handle it quickly and gracefully,” Chisholm observes. “Moreover, you’re likely to be operating in panic mode, and rash attempts to halt the cascade may end up just making it worse.”

Chisholm’s advice: Step away, get some coffee, or other beverage of choice, and breathe deeply. “Then slowly and rationally evaluate the cause of the cascading failures.”

Perhaps most important, once operations have returned to normal, is investigating exactly what went wrong. “Analyze the failure, and lack of fault tolerance that led to disaster, and update your systems so similar failures won’t take you down again,” Chisholm suggests.

What to Read Next:

Stress-Test Your Software to Prevent a Southwest-Type Calamity

Digital Twin Technology: Revolutionizing Product Development

Internal Network Security Mistakes to Avoid

About the Author(s)

John Edwards

Technology Journalist & Author

John Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.

See more from John Edwards

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

How to Keep an IT Mistake From Leading to a Cascade of Errors

Root Causes

Perils of Poor Design

Preventative Measures

Breaking the String

Keep Calm and Carry On

What to Read Next:

About the Author(s)

Editor's Choice