The Urgent Incident Process

Updated by Justin Lentz on Jul 25, 2017

Description (goal/purpose)

What classifies as an urgent incident? The accepted definition at CenturyLink Cloud is any incident that impacts two or more customers or an entire product or service. These incidents qualify because they have the capacity to affect a large amount of functionality for customers. These incidents take first priority with the Customer Care team. We will respond, as per service agreement, throughout the incident to keep customers informed of the current status.

Roles defined as needed during Urgent Incidents

Incident Commander – The Incident Commander role is to assemble the team and work with the CapCom, executives, and engineers to investigate, mitigate, monitor, and ultimately resolve an incident. They make sure restoration of service is always the top priority.

CAPCOM - This role's title was adapted from NASA's "capsule communicator", a person who is dedicated to communications with the crew of a spaceflight. The CAPCOM is the heart of the communication with the customers. As engineers work to solve the problem, the Incident Commander coordinates resolution, while the CAPCOM makes sure the customers stay informed. CAPCOM also communicates workarounds to current customer issues as they are found by Customer Care, ensuring that the Customer Care team is doing everything they can to mitigate the side effects of urgent incidents.

Subject Matter Expert (SME) – An engineer whose main role is the management and improvement of the impacted service.

Executive – An executive leader familiar with the CLC environment.

The stages of an Urgent Incident

Investigating – The first stage of the process. As soon as it is determined an incident meets the urgent classification, the Incident Commander and CapCom engineer are paged for immediate engagement. The Incident Commander evaluates the situation and pages out the SME, if they are not already involved. The SME has 15 minutes to identify the issue and formulate a plan for restoration. If after 15 minutes they are unable to do so, another SME is brought in to assist. This 15 minute cycle continues until the cause of the incident is identified and a restoration path is agreed upon.

Identified – The cause of the incident is known, but the restoration plan is still being formulated.

Restoring – We are in the process of restoring services.

Monitoring – We have completed the restoration plan and are now doing an initial check to make sure the platform is working as expected. During this time, we also ask all customers who reported an issue to verify their services are fully functional.

Resolved – All testing is complete and services are online. After the resolution is confirmed, a retrospective meeting is scheduled to review the incident and determine what steps are needed to prevent a reoccurrence.