Skip to content

Code Black: navigating organisational overload

Introduction

“Code Black” indicates a situation where the influx of problems exceeds what the organisation can handle. Sometimes, disaster strikes, and you just have to ride it out. It’s interesting to take a closer look at what happens in such situations. In my career, I’ve been part of multiple teams handling large disruptions, mainly at telco-style organisations with telco-sized challenges.

The techies’ response

One thing that always struck me is how the techies become hyper-focused, working as fast as they can to solve the issues at hand. Unexpected brilliance emerges out of nowhere, and existing interpersonal issues dissolve magically. Still, that may not be enough. If the team cannot see the solution or lacks insight, they risk running in circles. This is when an outside perspective becomes crucial.

Management’s role—and its limits

Interestingly, management often believes they can provide that outside perspective. However, their view is clouded by considerations regarding politics, economics, panicking customers, and other non-technical aspects. They suddenly require a lot of information and start distracting the people actually working on a solution, rather than limiting themselves to bringing coffee, pizza, and keeping other disruptions at bay. As a result, the Operations team’s door is often locked from the inside to keep management out.

What’s needed is someone with a laser-sharp focus on the technical issue at hand. Fortunately, your management likely hired someone specifically for this job: the architect/designer of the solution (which is why I firmly believe you should always have your designers in-house).

Triage: the first stage of Code Black

Just like in the TV series, the first stage in a Code Black situation is to introduce a triage system: what needs to be done now, what can wait, and what is beyond saving. In IT, this translates to decisions about how long you will continue trying to solve the problem(s), which issues are parked (most of the management-related issues mentioned above), and when you need to decide that the problems may no longer be fixable within the agreed timelines, requiring the Disaster Recovery (DR) scenario to be invoked.

The trade-off: Recovery vs. Disaster Recovery

Your best people will be working on the recovery after a disaster. However, it is the same group of people that is needed to start up DR. So, invoking the DR scenario implies taking away crucial resources from the problem at hand in order to get DR up and running. The direct result is a significant slowdown of the existing recovery process in favour of bringing the business back online.

DR also implies opening up significant financial resources to start procuring new equipment (if you can get it). The loss of revenue, the loss of goodwill, and the payment of damages may quickly outweigh the direct cost of ‘stuff’.

In addition, DR may also imply clearing out existing non-crucial systems to make room for the DR solution. This will disrupt other parts of your company, such as development and innovation.

The decision-making process

I find it interesting who is allowed to make that decision. Direct management is afraid of the financial impact on their budget. Engineers always think they are close to the solution and want to try just one more thing. Most intuitively feel and understand the impact of going into DR. Most of the time, it is the outsider who makes the call. Sometimes that is the architect, and sometimes upper management decides to bite the bullet.

Now what?

As soon as the decision is made, you need a crystal-clear run book on what to do next. DR is not another troubleshooting exercise; it is a military-style operation where everyone knows exactly what needs to be done, who is doing what, and in which order. Deliberations on financial options, risks, and possibilities have no place here. That was done when the DR scenario was drafted. Now, action is needed—swift and decisive action. Collateral damage is expected, accepted beforehand, and should not bother anyone.

Crisis management and leadership

DR is crisis management. This is generally not the type of leadership your existing management team can provide. These people are hired to make conscious choices, limit risk, and optimise the status quo, which requires a completely different leadership style.

As a result, DR needs to be started in complete parallel to the continuing troubleshooting process and positioned directly under top management. The existing management team can continue troubleshooting, although with a smaller and less experienced team. After all, they may actually solve the problem, at which point the DR scenario can be abandoned.

Where is the technology discussion in this?

Although it may look like a technical problem that needs to be solved, this is neither the time nor the place to have these discussions. In a DR scenario, the existing solution is made to function again—after all, it worked up to that moment.

That does not mean a technology discussion is irrelevant, but not here and now. That is something for another blog.