Policies and Procedures, Part 8 - Incident Management Process
In part seven of this series
we completed a three article cycle looking at the lifecycle of a server and at
the associated processes that accompany each phase. Here in part eight we are
going to start a look at production process, starting with the Incident
Management process.
Why an Incident Management Process?
When an emergency occurs, it is hard to
know what to do first. On the other hand, it is impossible to write a process
to account for every emergency. There are many process standards such as ITIL
and Six-Sigma, but they can be difficult to adopt and particularly in small /
medium organizations, difficult to adapt to the environment. What we will
review here is a basic ITIL-based adaptation designed for moderately sized
organizations
Without as Incident Management Process:
Without a standardized process, the system
administration team goes ahead and attempts to deal with what appears to be a
service disruption. Services aren't running on a critical application server,
the team is notified and responds. Restarting the service doesn't work, so the
team tries a reboot. Meanwhile, the application administrator got a call about
the application being down and already connected into the server and was trying
to fix the issue with a vendor patch. The reboot corrupts the patch install
and leaves the application in limbo between the patched version and an
un-patched version. This results in the server having to be reinstalled. The
need for redundancy becomes apparent, but ultimately, the system administrator
gets in trouble for not realizing what the application administrator failed to
communicate - that he was in there working on the system. Sure, the
application administrator should have said something - but he has got the
ever-Wiley response "Well, they should have asked..." An Incident Management
process would have avoided this confusion.
Process:
The process depends on your organization.
I'd recommend steps similar to the following.
First, set a date with all involved to
review a standard set of response procedures, along with the incident review
meetings and reports accompanying them (examples below.) In order for these
process to work, they must become standard policy, followed by all relevant
groups within the organization.
This policy should define the
remediation steps and decision point actions for the following components
constituting the service disruption and incident response life cycle:
-
Incident Discovery
-
Incident Evaluation and Restoration Planning
-
Internal Support Staff Notification
-
Client Notification
-
Incident Documentation
-
Incident Escalation
-
Incident Resolution
-
General Communications Steps
-
Service Disruption Report
-
Monthly Incident Review Meeting and Process Modification
Service Disruption Policy at a Glance
These procedures and processes should
evolve as capabilities are added to the process and modifications are made.
For the moment, however, the following steps are required when an incident
develops. Because all events are different and there is a natural conflict
between time document and communicating an issue vs. solving it, individual
judgment needs to be applied to each incident on what happens when.
1.
Begin filling out the Incident Log with the date and time of events up to
that point.
2.
Notify the user support group of the situation.
3.
Assess the situation and begin information gathering and troubleshooting.
4.
Bring on other internal resources if necessary including the system
technical contacts. Hand off the issue to the application support group if the
issue appears to be application based.
5.
Update the Application Status Web Page or Phone Line (if you have one)
with information on the systems affected.
6.
Notify the next level of management.
7.
If outage persists, hold status review conference with system owner,
management, and technicians.
8.
Attend monthly meetings to review Incidents and make application process
changes
9.
Monthly Report issued for head of organization.
Defining a Service Disruption
For the purposes of these procedures, a
service disruption is any planned or unplanned systems downtime that lasts
longer than 15 minutes or has an immediate or serious impact on a
organization-wide system or IT-based service during usage hours.
Typical incidents
could include:
-
A organization-wide
service such as email, domain or the network becoming unavailable.
-
Losing a departmental
service such as PeopleSoft.
-
Server or
communications issues that result in severely degraded application services.
-
The detection of a
large scale network attack, virus infestation, or other disrupting event.
Although departmental application
downtime may have less affect than organization-wide systems disruption, the
procedures of this guide will still be followed.
Roles and Responsibilities
Each service disruption is assigned to
an Incident Manager for the duration of the event. This role can be re-assigned
as circumstances warrant. (I.E. Server team to Application Admin.)
In addition to the Incident Manager, a
Communications Manager will also be assigned if escalation warrants. These
parallel efforts reflect the need of the Incident Manager to stay focused on
resolving the incident.
Incident Manager - This role is responsible for
maintaining ownership of all aspects of the event. This includes assessment,
troubleshooting, escalation, logging, resolving, and closing out the event. By
default, these duties will be performed by the discovering on-call technician,
but can be reassigned by management.
Communications Manager - This individual is
responsible for communicating all aspects of the incident to individuals inside
and outside of the IT department. This includes other support staff, IT
organizations and the customer community. In some cases this role is handled by
the Incident Manager if it is a relatively simple and short-lived disruption.
This individual works with the Incident
Manager, obtaining and posting status updates and estimated time of service
restoration. Communications include but are not limited to:
-
Individual phone calls
-
Coordinating conference calls
-
Posting information on the web
-
Email broadcasts
-
Voice-mail broadcasts
Unplanned Service Disruption
Lifecycle
Incident Discovery
The
initial detection of incidents can originate from a variety of sources: IT
department staff might be paged; the user support group might notice the
problem, or a user of the system might initiate the complaint.
Ideally, notification will first come as a predictive
failure via an event management system such as Nagios or Altiris. Regardless
of the source, all incidents must start a standard process that will focus on
resolving and documenting the problem throughout the life of the event.
It is the responsibility of the IT department employee
who discovers or is notified of a service disruption to assume the role of
Incident Manager and initiate the following procedures:
-
Note the date and time of the initial discovery
-
Verify the nature and extent of the service disruption.
For example, is it limited to a single user or workstation or a global
disruption? Is the system up but the network down? Who is affected?
-
Once an assessment has been made, directly notify the user
support group supervisors, and the Management of the Server team.
-
Begin filling out the Service Disruption Log
*Note: A voice mail or e-mail
message is not considered direct communication. The discovering employee should
speak directly with the next person in the chain or escalate it until an
individual is notified in person.
Incident Evaluation and
Restoration Planning
The
aim of incident evaluation and restoration planning process is to identify how
the incident can be understood and resolved as quickly as possible. The
Evaluation and Restoration form contains the following:
-
Statement of the "problem" as known at this time.
-
Contacting other appropriate staff, communicating the
problem and soliciting input.
-
Assign and contact the Communications Manager.
-
Breakdown of the incident detailing components,
interfaces, and likely causes. Verify or rule out causes.
-
Break out and assign subtasks. Correlate probable options
and begin applying remediation.
-
All activities and communications are logged throughout
the event
Internal Support Staff
Notification
Using a contact
list of individuals by functional area of responsibility, initiate contacts via
pager, cell phone, or other interactive means. Set up and schedule conference
calls if necessary. Define requirements for either on-site or off-site
presence.
Client Notification
Either the Incident Manager or the
Communications manager will contact the system owner defined in the service
SLA. Together they will define and craft the communications strategy for
alerting those users of the system.
Information will be posted on the System
Availability website and either voice mail or email communications will be
discussed with management. In the case of significant systems downtime, a
predefined calling tree dialing system should be activated.
Incident Documentation
Using the forms
found in the Appendix, the Incident Manager will keep documentation and logs up
to date.
Incident Escalation
If management escalation is required
during the disruption, the Incident Manager will either contact the next
management level or request that the Communication Manager do this. Situations
where upper management learns of a serious disruption from sources outside their
organization are to be avoided.
Incident Resolution
The Incident Manager and response team
should hold a review meeting at regular intervals to discuss progress, update
the restoration plan, and document the process. The interval between review
meetings should vary depending on the amount of activity occurring. For example,
during an intense phase of investigation, the meetings should be more frequent
than during a later period when the staff is simply waiting for confirmation
that the resolution actions have been successful.
Regular management updates should be provided to keep
all parties informed, discuss progress, and decide if escalation is required.
The incident manager should facilitate all management team progress meetings.
When service has been restored the
Communications Manager is responsible for posting this information on Pulse and
following up with anyone contacted during the event.
General Communication Steps
The
objective of the communications plan in Appendix A is to provide direction to
the Communications Manager. It should cover:
-
Who needs what level
and frequency of information?
-
Contact details for
all parties affected and in the management chain.
-
The different types
of updates that will be required for the different groups affected by and
working on the incident.
-
How often each type
of update is required.
-
Who is authorized to
sign-off on content?
-
The number and
membership of the conference calls.
Service Disruption Report
The incident owner will complete the
Service Disruption Report form within two business days of the restoration of
services. The report will include: affected service, nature of the disruption,
reason for the disruption, length of time between the onset and resolution and
any recommendations for future improvement.
Incident Review and Process
Modification
Monthly meetings will be held to review
service disruptions and how they were responded to and resolved. All the
documented incidents of the previous month will be discussed and the following
agenda items will be addresses:
-
If possible, the
Incident manager reviews the incident and the major events logged.
-
Discuss the success
of the overall response, the issues affecting system recovery, the quality of
communications, and a general success of what was learned.
-
Discuss why the
incident occurred in the first place. What were the root causes? Is a recurrence
likely and, if so, what can be done to prevent it? Document any actions
identified for follow-up.
-
Discuss whether the
issue was detected soon enough. Could it have been recognized sooner?
-
Underlying issues and
corrective actions identified during the major incident review should be
recorded and tracked by problem management. It is important that problem
management have the authority and management backing to ensure that the
corrective actions are progressed by all parties.
-
A management report
will be developed to provide a summary of the documented incidents.
Planned Service Disruption Policy
Maintenance and planned service downtime windows require
advanced notice to the customers affected. It is our goal to provide at least a
week's notice of systems downtime. The distribution list should include
clients, internal IT department staff, and the staff of other affected IS
organizations.
The Form:
Now, this is just an example - add and remove items as
needed.
This section contains the basic incident details.
|
Incident Name |
|
|
Incident Reference # |
|
|
Date/Time Service Went Down |
|
|
Date/Time Incident Recorded |
|
|
Discovered/Reported By |
|
|
Incident Manager |
|
|
Communications Manager |
|
|
Systems Affected |
|
|
System Owner |
|
|
Populations Affected |
|
|
Symptoms/Problem Description |
|
|
Staff Involved |
|
|
Determined Cause |
|
|
Resolution |
|
|
Date/Time Service Restored |
|
|
Outage Duration |
|
Notes
Using this form, list all significant activities, communications, updates,
staff involvements, escalation steps, complications, and other events relevant
to understanding the sequence of events and how they affected the outcome of the
incident.
Person Filling Out Form ________________________________________________
|
Date/Time |
Event or People Contacted |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|