[System|Toolbox] Tools
for the Art
of System
Administration
HOME STAFF FAQ ADVERTISE LEGAL
Policies and Procedures, Part 8 - Incident Management Process
Chris Campbell
Thursday September 08, 2011 01:00 AM
In part seven of this series we completed a three article cycle looking at the lifecycle of a server and at the associated processes that accompany each phase. Here in part eight we are going to start a look at production process, starting with the Incident Management process.

Policies and Procedures, Part 8 - Incident Management Process

 

                In part seven of this series we completed a three article cycle looking at the lifecycle of a server and at the associated processes that accompany each phase.  Here in part eight we are going to start a look at production process, starting with the Incident Management process.

 

Why an Incident Management Process?

                When an emergency occurs, it is hard to know what to do first.  On the other hand, it is impossible to write a process to account for every emergency.  There are many process standards such as ITIL and Six-Sigma, but they can be difficult to adopt and particularly in small / medium organizations, difficult to adapt to the environment.   What we will review here is a basic ITIL-based adaptation designed for moderately sized organizations

Without as Incident Management Process:

                Without a standardized process, the system administration team goes ahead and attempts to deal with what appears to be a service disruption.  Services aren't running on a critical application server, the team is notified and responds.  Restarting the service doesn't work, so the team tries a reboot.  Meanwhile, the application administrator got a call about the application being down and already connected into the server and was trying to fix the issue with a vendor patch.   The reboot corrupts the patch install and leaves the application in limbo between the patched version and an un-patched version.  This results in the server having to be reinstalled.   The need for redundancy becomes apparent, but ultimately, the system administrator gets in trouble for not realizing what the application administrator failed to communicate - that he was in there working on the system.   Sure, the application administrator should have said something - but he has got the ever-Wiley response "Well, they should have asked..."   An Incident Management process would have avoided this confusion.

Process:

                The process depends on your organization.  I'd recommend steps similar to the following.

                First, set a date with all involved to review a standard set of response procedures, along with the incident review meetings and reports accompanying them (examples below.)  In order for these process to work, they must become standard policy, followed by all relevant groups within the organization.

                This policy should define the remediation steps and decision point actions for the following components constituting the service disruption and incident response life cycle:

-         Incident Discovery

-         Incident Evaluation and Restoration Planning

-         Internal Support Staff Notification

-         Client Notification

-         Incident Documentation

-         Incident Escalation

-         Incident Resolution

-         General Communications Steps

-         Service Disruption Report

-         Monthly Incident Review Meeting and Process Modification

 

Service Disruption Policy at a Glance

 

                These procedures and processes should evolve as capabilities are added to the process and modifications are made.   For the moment, however, the following steps are required when an incident develops.  Because all events are different and there is a natural conflict between time document and communicating an issue vs. solving it, individual judgment needs to be applied to each incident on what happens when.

 

1.       Begin filling out the Incident Log with the date and time of events up to that point.

2.       Notify the user support group of the situation.

3.       Assess the situation and begin information gathering and troubleshooting.

4.       Bring on other internal resources if necessary including the system technical contacts.  Hand off the issue to the application support group if the issue appears to be application based.

5.       Update the Application Status Web Page or Phone Line (if you have one) with information on the systems affected. 

6.       Notify the next level of management.

7.       If outage persists, hold status review conference with system owner, management, and technicians.

8.       Attend monthly meetings to review Incidents and make application process changes

9.       Monthly Report issued for head of organization.

 

Defining a Service Disruption

 

                For the purposes of these procedures, a service disruption is any planned or unplanned systems downtime that lasts longer than 15 minutes or has an immediate or serious impact on a organization-wide system or IT-based service during usage hours. 

 

Typical incidents could include:

-         A organization-wide service such as email, domain or the network becoming unavailable.

-         Losing a departmental service such as PeopleSoft.

-         Server or communications issues that result in severely degraded application services.

-         The detection of a large scale network attack, virus infestation, or other disrupting event.

 

                Although departmental application downtime may have less affect than organization-wide systems disruption, the procedures of this guide will still be followed.

 

Roles and Responsibilities

 

                Each service disruption is assigned to an Incident Manager for the duration of the event.  This role can be re-assigned as circumstances warrant.   (I.E. Server team to Application Admin.)

 

                In addition to the Incident Manager, a Communications Manager will also be assigned if escalation warrants.  These parallel efforts reflect the need of the Incident Manager to stay focused on resolving the incident.

 

Incident Manager - This role is responsible for maintaining ownership of all aspects of the event.  This includes assessment, troubleshooting, escalation, logging, resolving, and closing out the event.  By default, these duties will be performed by the discovering on-call technician, but can be reassigned by management.

 

Communications Manager - This individual is responsible for communicating all aspects of the incident to individuals inside and outside of the IT department.  This includes other support staff, IT organizations and the customer community.  In some cases this role is handled by the Incident Manager if it is a relatively simple and short-lived disruption.

 

                This individual works with the Incident Manager, obtaining and posting status updates and estimated time of service restoration.  Communications include but are not limited to:

-         Individual phone calls

-         Coordinating conference calls

-         Posting information on the web

-         Email broadcasts

-         Voice-mail broadcasts

 

Unplanned Service Disruption Lifecycle

 

Incident Discovery

 

                The initial detection of incidents can originate from a variety of sources: IT department staff might be paged; the user support group might notice the problem, or a user of the system might initiate the complaint. 

Ideally, notification will first come as a predictive failure via an event management system such as Nagios or Altiris.   Regardless of the source, all incidents must start a standard process that will focus on resolving and documenting the problem throughout the life of the event.

 

It is the responsibility of the IT department employee who discovers or is notified of a service disruption to assume the role of Incident Manager and initiate the following procedures:

 

-         Note the date and time of the initial discovery

-         Verify the nature and extent of the service disruption.  For example, is it limited to a single user or workstation or a global disruption?  Is the system up but the network down?  Who is affected?

-         Once an assessment has been made, directly notify the user support group supervisors, and the Management of the Server team.

-         Begin filling out the Service Disruption Log

*Note: A voice mail or e-mail message is not considered direct communication.  The discovering employee should speak directly with the next person in the chain or escalate it until an individual is notified in person.

 

 

 

Incident Evaluation and Restoration Planning

                The aim of incident evaluation and restoration planning process is to identify how the incident can be understood and resolved as quickly as possible.  The Evaluation and Restoration form contains the following:

 

-         Statement of the "problem" as known at this time.

-         Contacting other appropriate staff, communicating the problem and soliciting input.

-         Assign and contact the Communications Manager.

-         Breakdown of the incident detailing components, interfaces, and likely causes.  Verify or rule out causes.

-         Break out and assign subtasks.  Correlate probable options and begin applying remediation. 

-         All activities and communications are logged throughout the event

 

Internal Support Staff Notification

 

                Using a contact list of individuals by functional area of responsibility, initiate contacts via pager, cell phone, or other interactive means.  Set up and schedule conference calls if necessary.  Define requirements for either on-site or off-site presence.

 

Client Notification

 

                Either the Incident Manager or the Communications manager will contact the system owner defined in the service SLA.  Together they will define and craft the communications strategy for alerting those users of the system.

 

                Information will be posted on the System Availability website and either voice mail or email communications will be discussed with management.  In the case of significant systems downtime, a predefined calling tree dialing system should be activated. 

 

Incident Documentation

 

                Using the forms found in the Appendix, the Incident Manager will keep documentation and logs up to date.

 

Incident Escalation

 

                If management escalation is required during the disruption, the Incident Manager will either contact the next management level or request that the Communication Manager do this. Situations where upper management learns of a serious disruption from sources outside their organization are to be avoided.

 

Incident Resolution

 

                The Incident Manager and response team should hold a review meeting at regular intervals to discuss progress, update the restoration plan, and document the process.  The interval between review meetings should vary depending on the amount of activity occurring. For example, during an intense phase of investigation, the meetings should be more frequent than during a later period when the staff is simply waiting for confirmation that the resolution actions have been successful.

Regular management updates should be provided to keep all parties informed, discuss progress, and decide if escalation is required. The incident manager should facilitate all management team progress meetings.

 

                When service has been restored the Communications Manager is responsible for posting this information on Pulse and following up with anyone contacted during the event. 

 

General Communication Steps

 

                The objective of the communications plan in Appendix A is to provide direction to the Communications Manager.  It should cover:

-         Who needs what level and frequency of information?

-         Contact details for all parties affected and in the management chain.

-         The different types of updates that will be required for the different groups affected by and working on the incident.

-         How often each type of update is required.

-         Who is authorized to sign-off on content?

-         The number and membership of the conference calls.

 

 

Service Disruption Report

                The incident owner will complete the Service Disruption Report form within two business days of the restoration of services. The report will include: affected service, nature of the disruption, reason for the disruption, length of time between the onset and resolution and any recommendations for future improvement.   

 

Incident Review and Process Modification

 

                Monthly meetings will be held to review service disruptions and how they were responded to and resolved.  All the documented incidents of the previous month will be discussed and the following agenda items will be addresses:

 

-         If possible, the Incident manager reviews the incident and the major events logged.

-         Discuss the success of the overall response, the issues affecting system recovery, the quality of communications, and a general success of what was learned.

-         Discuss why the incident occurred in the first place. What were the root causes? Is a recurrence likely and, if so, what can be done to prevent it? Document any actions identified for follow-up.

-         Discuss whether the issue was detected soon enough. Could it have been recognized sooner?

-         Underlying issues and corrective actions identified during the major incident review should be recorded and tracked by problem management. It is important that problem management have the authority and management backing to ensure that the corrective actions are progressed by all parties.

-         A management report will be developed to provide a summary of the documented incidents. 

 

Planned Service Disruption Policy

Maintenance and planned service downtime windows require advanced notice to the customers affected.  It is our goal to provide at least a week's notice of systems downtime.  The distribution list should include clients, internal IT department staff, and the staff of other affected IS organizations. 

 

The Form:

                Now, this is just an example - add and remove items as needed.

 

 

 

Incident Summary

This section contains the basic incident details.

Incident Name

 

Incident Reference #

 

Date/Time Service Went Down

 

Date/Time Incident Recorded

 

Discovered/Reported By

 

Incident Manager

 

Communications Manager

 

Systems Affected

 

System Owner

 

Populations Affected

 

Symptoms/Problem Description

 

Staff Involved

 

Determined Cause

 

Resolution

 

Date/Time Service Restored

 

Outage Duration

 

 

Notes

 


 

Incident Log

Using this form, list all significant activities, communications, updates, staff involvements, escalation steps, complications, and other events relevant to understanding the sequence of events and how they affected the outcome of the incident.

Person Filling Out Form ________________________________________________

 

Date/Time

Event or People Contacted

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

 

 

 

 

Sections
   Comments
   History

Toolboxes
Windows
Unix
Novell
Linux
OSX
Networking
General
Virtualization
Operations Management

Submit
   Comment
   Article
   Tool
   Link

Comment? - Or do you think this article blows chunks and you could write a better one in your sleep? Then do it!
View Comment Page

Copyright © 2004, The Binary Freedom Project, LLC.