Handling Major Incidents

I’ve been blogging previously about Incident Management, and no discussion about Incident Management would be complete without mentioning Major Incidents.

First of all, let me offer a definition: A Major Incident is any Incident that has a significant or substantial effect on part of all of the business.

Leaving aside the issue about VIPs (a stuck keyboard belonging to your Chief Executive causing panic on the help desk), we can say that Major Incidents usually affect significant numbers of employees, and involve important enterprise level services.

So how do we manage Major Incidents? The answer is ‘we do all the stuff that we normally do’ – plus we do some other things. So we make sure that we log the Incident properly, use the CMDB as required, and make the best initial assignment that we can. If you are a Serio user, presumably you’ve set up a Broadcast Alert. Broadcast Alerts were specifically designed for major Incidents, and can be used to send emails about important Incidents to lots of people. You might also use the Serio Text Message Gateway to send text (SMS) messages.

Coming back to the ‘other things’ I mentioned above, this is where your Incident Manager (you have one, right?) takes a lead. What follows could, with a little effort, form the basis of a Major Incident Procedure.

  1. If you can, make a rough estimate of how long the Incident will last, or more accurately, how long the missing service will be unavailable. You might be reluctant to do this, as people have a tendency to hold you to rough estimates at times of stress, but you should do it anyway.
  2. Inform the key stakeholders. By this I mean do more than send them an automated email, use the CMDB to identify the affected parties, and let them know about the Incident – don’t assume they know. Give them your estimate from 1. above. This way, they will know if it is worth starting manual procedures, and it will help them deal with their customers.
  3. If you are a Serio user, post the Incident to the Service Status website, as that is what it is for. Post updates on the Incident here during the day, your customers will appreciate it.
  4.  Inform your Problem Manager (you’ve got one, right?). I’ve blogged about Problem Management before here and here (and other places), and we have a Problem Management white paper on that subject for download.
  5. Once the Incident is resolved, perform a review. Analyse the Incident from different perspectives, which should include:
  • Could the Incident have been avoided in the first place?
  • What was the estimated cost to the business of the Incident?
  • How well did we perform in restoring the missing service to users?
  • Did we communicate effectively, both between ourselves and our customers?
  • How well did our internal documentation perform – for instance, our recovery documentation?
  1. Report your findings clearly with recommendations for the future.