Complex Systems Fail in Complex Ways

5 min readJan 17, 2025

In case you’ve not come up for air in the last 36 or so hours, CapitalOne is having a bad few days. I’m not going to retell the story which is hopefully resolved by the time you read this, but something sticks out in my mind that I don’t hear discussed often enough when these types of incidents happen.

The Outrage

Inevitable is the outrage you see on the socials when a large company has an outage. People will resort to shaming and as one of my good friends likes to say “talking out of their a**” when they understand very little about some of these highly complex systems. Complex systems like, for example, banking.

As someone who received physical checks I would have to go to the bank to deposit and get physical cash in return after a few days, I’m still in awe of how far technology has come. I also understand more than the average person on the street just how complex the systems and processes are that take a paycheck from your employer and magically make it show up in your account where you’re able to debit against it — that’s magic. There aren’t dozens of computer-based processes, there are thousands in some cases.

So when people make these outrageous “How dare they have an outage” comments I both roll my eyes, and perk my ears. I always wonder if there’s something truly novel going on, or if someone tripped over a power cord and blacked out a data center that was pivotal for routing packets from point A to point X. More often than not, people who freak out have no idea how complex things are, and maybe they don’t care.

Actual Reality — Yeah, it’s “Complicated”

I’ve worked inside the banking system. I’ve done security for a company that was a big chunk of the process from when you swiped a credit card, to where a product magically showed up at your door. Sometimes the answers are simple — “The circuit that supports traffic from this city to that one was cut by a guy named Mike with a backhoe.” Sometimes though, and more often than not, the answer is “it’s complicated”.

I’d love it if we lived in a simple world. Company X makes a poor decision, and it results in outage Y. When that happens, sure, let’s go pummel them on social media until they learn their lesson (I’m kidding, obviously…). But the world just isn’t that simple. With hundreds of systems, processes, and thousands of different pieces of technology that have to work just right to do what we want — when things fail it’s not simple. Luckily there is technology out there to help us troubleshoot (disclosure: I happen to work for a company that does this) and pinpoint issues quickly. But if your network or applications aren’t instrumented in a way to be able to do this — you’re left guessing and checking. Oh, and those 12+ hour long conference bridges with dozens of people who all “try something real quick” the result of which is a new problem. We fumble and stumble our way to a fix, and in some really scary cases things “just start working” and we’re not sure why. I’ve been a part of that too.

But complex systems like banking, logistics, healthcare, manufacturing, retail supply chains — these things can fail in minute or massive ways. Each time we work hard to fix the problem, then afterwards after we’ve had a few hours sleep and some time to think about it we go back and divine what lessons we should learn and apply that to our principles and practices and improve. Well, some of us do that. Some of us out there just shrug our shoulders, tell people “it’s complicated, sorry we couldn’t have done anything about it” and move on.

Plan, Anticipate, Respond, Learn

The take-away from any big failure is the way we can improve. Here’s what you should be doing, or at least how I see it.

Plan

Set your organization, your infrastructure, and your applications up as much as you can to be resilient. Resilience is important, but it is not a magic wand that makes outages go away. Plan for the day when things will go horribly sideways so that you have available telemetry, staff, and processes to figure out where things broke, how, and what you’re doing to do to remedy the situation.

Anticipate

Now, try your best to look at available telemetry and get ahead of the outage. If you start seeing “Zero Window” responses climbing for any particular endpoint of an application you can be fairly sure that call to the helpdesk telling you the “application is down” is coming. If you can anticipate outages and failures you’re not bulletproof, but you can tell your customers and partners that you’re on it before they even notice the issue. That’s worth its weight in gold in most situations. That capability to anticipate also requires you to do a lot of built-in information gathering, profiling, and analysis — stuff you can’t easily do after it’s all built and launched.

Respond

When it goes south, whether it’s catastrophic failure or a slow-down with minor impact, be responsive and decisive. Take action that remedies the situation and ensure that you do it as quickly as possible with sensitivity to all of the people that are going to freak out. Even a great response can be completely undone with poor PR. Remember that. Response is as much about how fast and accurate your fix it, as it is about how well you explain it to those impacted.

Learn

If you learn nothing from an outage — no matter how big or small — you’re wasting opportunity. There is always something to learn. Even if the outage is at a third party that you have no control over, there is still something you can learn, and thus apply it internally to be better. Never waste an opportunity to learn, see things from a different angle, and gain some new insights. As little kids many of us wanted to be Sherlock Holmes — and then we got into IT. Well…don’t give up on your dreams, go figure out that root cause and learn from it!

Sh!t happens, it’ll happen again

If you’re one of those people who make comments like “I’m never using that company again because they had an outage” — we probably won’t see eye to eye. Everything fails, on a long enough timeline, in certain high-stress situations. Nothing is perfect. Your bank will have a bad day. Or two. But they’ll make it right, and the next time something like what just happened presents itself they’ll be ready for it if they learned something.

Your car, your relationship, your dog, and your TV — all of these things will fail or have issues at some point. But the issue or outage isn’t the end of the relationship, rather, it’s just another bump on the road of life, and remember these things are complex. It’s happened before, it’ll happen again, and you’ll live though that one too.

Complex Systems Fail in Complex Ways

The Outrage

Actual Reality — Yeah, it’s “Complicated”

Plan, Anticipate, Respond, Learn

Sh!t happens, it’ll happen again

Written by Rafal Los

No responses yet