Act 1
The Setup
Imagine being audited by the IRS. Every minute, every day, 365 days a year. Stress builds and anxiety deepens, relieved (but only momentarily) when daily reports come back free of incident.
That’s what it is like to work in Production Support. Audits of one sort or another (formal and otherwise) and the incident reports that sprout from them have become the new normal in the age of “everything tech.” In our world, incidents mean smart phone apps that don’t work, super-slow websites, social media platforms that are down, and more. And our “auditors” number in the thousands, maybe even millions (if we’re “lucky”).
In our media-blitz environment, incidents get noticed, reported, and shared. These “disasters” – like being unable to upload a picture of your dog rolling over – are treated like breaking news in today’s online environment. The size of our audience and the extent of inconvenience these incidents create are what drives the swiftness and severity of the backlash.
Rarely do such incidents rise to the level of unrecoverable catastrophe. Or so you’d think. But keep in mind: the only thing that has advanced as far and as fast as technology is user expectation. Incident is not narrowly defined as just the failure of a product or service. It’s also the reduction in quality of that same product or service (however small). In other words, the frustration end users experience qualifies as an incident.
Incidents may now be a “normal part of life in the age of tech,” but that doesn’t mean we should lower our standards to accommodate this new normal. Most of us agree that we are capable of learning from our mistakes and becoming more adept at avoiding them in the future.
But I’m getting ahead of myself. Let’s examine a handful of fictional scenarios to get a sense of the incidents that can arise, and then evaluate how they were handled.
Act 2
Scenarios, confusion, and restoration
If you’ve ever fallen victim to a favorite smart phone app that simply refuses to respond, you’re in very good company. Despite the vast population of similarly situated victims, your own response is often one of personal affront. “Why does this happen only to me?” you may exclaim! When the dust settles and you consider the situation more rationally, you likely begin to consider what should be happening during this time of inconvenience (Let’s assume the unresponsive app limits your ability to communicate in an emergency, as opposed to playing a video game).
Based on my years of experience, I can assure you that incidents of this type and complexity generate significant confusion on the part of the production support folks working behind the scenes to recover your app’s lost functionality.
Why would there be confusion? Let’s review a list of thoughts and questions that usually barrage any team during an incident response.
- What happened?
- How could this have happened?
- Who’s responsible?
- What can we do to make sure it doesn’t happen again?
- How quickly can we get an incident report out for this?
- What do we have to do to get our app back online as quickly as possible?
At the start of the inquiry, only one of the above truly matters. And only that response is relevant during an incident. That is number 6, “What do we have to do to get our app back online as quickly as possible?” It’s the only one that truly supports the app’s sole reason for existence – i.e., to facilitate digitized communication through text or phone.
Especially in emergency, the app must be available at all times. And when it’s not, our users feel isolated, frustrated, and angry. In our production support roles, we endure similar feelings because we’ve failed to deliver the app’s original intent.
What can we learn?
Let’s continue our review of a few more fictional incidents to see what we can learn or understand from the information that has been shared with the public with respect to an active incident.
- A large and prestigious web blog hosting company suffers a complete outage. Fortunately, communications and status updates were timely and transparent throughout the event. Additionally, event details leading up to the outage were disclosed in a public blog post.
- Incident creating action – a software change was developed and deployed to all sites hosted.
- Insights – knowing that a software change was deployed to all sites revealed several planning deficiencies.
- Software change was not adequately tested.
- Software change should have been rolled out to subset of sites.
- Rollback plan was inadequate; following the incident, there was a long delay in restoring all sites to pre-incident status.
- A large search engine company suffered a complete outage of all API functionality. All sites, applications, and clients calling or depending on this API were impacted.
- Incident creating action – the change was pushed out using automation.
- Observation – rollback process had a long duration.
- Insights – incident report was shared publicly.
- Software change was not adequately tested.
- Rollback plan was inadequate.
- A very large company’s cloud-based desktop computing applications became unavailable.
- Incident creating action – the sudden event resulted from the release of new security features.
- Insights – while no formal incident report was released, knowledge of the security update was enough to extrapolate the following.
- Software change was not adequately tested.
- A large online video streaming provider experienced an outage in some key large regions.
- Incident creating action – infrastructure change was underway in a number of regions.
- Insights – a formal incident report disclosed that the pre-approved response plan was not adhered to.
Act 3
Magical Planning vs. Experiential Wisdom
In each of the incidents described above, investigation led to assignment of a root cause. On many occasions, this suggests that a single individual’s action or inaction was responsible for the incident. (As an aside, please remember that assigning responsibility should not be the first action an inquiry should attempt. As we noted earlier, working to determine what we need to do to get our app back online is the priority.)
That said, as we examine the commonalities across incidents, it becomes clear that our assignment of a root cause jumped the gun – the sole exception being lack of adherence to the response plan. In that instance, the pre-approved plan should not have permitted granting to any individual(s) the administrative credentials to implement changes in the first place. Had those change permissions been configured correctly, the ability to ignore them would’ve been significantly diminished.
The common thread here is a general lack of planning. I call it “magical planning,” which means that teams are hoping for the best while failing to anticipate or prepare for the worst. Why is magical planning such a popular approach to implementing change? Simple. It’s easy, and it doesn’t cost a thing (at least not up front). In fact, it’s not really planning at all. With magical planning, you’re merely hoping that magic will fill in the performance gaps that will inevitably occur. But when it’s not your lucky day, the gaps remain unfilled.
Experiential wisdom delivers a better approach:
Magical Planning | Experiential Wisdom |
Inadequate pre-change testing | Anticipate incidents, document response process, implement training |
Inadequate scoping of the change | A phased-in approach to a subset of users to mitigate the impact |
Inadequate post-change back-out plans | A more thorough post-change plan would have restored service much more quickly |
Earlier I outlined six potential responses that an organization could employ during an incident. I’m confident there are additional workable responses, but of the six I offered, only one was truly useful: service restoration (getting back online as soon as possible).
If you are involved in an incident response, be sure to show compassion to your incident responders. Your shared objective is service restoration. And remember, when you find yourself in the throes of a service disruption incident, set aside your eagerness to identify the root cause or to determine who’s at fault. To prevent future disruptions. None of those responses will help your customers in the short term.
Service disruptions just like the ones we described above are preventable. Research suggests that strong, documented processes – supported by regular training and situational practice – could eliminate 80 percent of today’s IT incidents. Craftsmen and women rarely blame their tools; in other words, architecture and equipment may play a role in the flaw or the solution, but the source of the problem frequently lies elsewhere.
When you plan with honesty and inclusiveness, your teams will be prepared even for the unanticipated incident. Share your war stories so that others can learn from earlier mistakes. In the heat of the moment, focus on service recovery to ensure you are delivering available and performant systems for your customers.
There’s no way to avoid all incidents; admittedly, they have become a “normal part of life in the age of tech.” But we can resist this new normal by crafting incident response processes, documenting them, and training on them. By learning from such incidents – whether catastrophic or not – our customers’ expectations will shift from frustration to gratitude.