Reading Time: About 6 minutes
Our Customer Care team at CenturyLink Cloud® is unconventional in many ways; one way is their use of internally-built automation that make their tasks simpler and more productive. They've got a whole list of scripts, tools, apps, and bots that they have developed to aid in their efficiency. The team has created these tools so that they can spend more time helping customers instead of performing repetitious tasks; their continuous improvement process helps them increase their pace and decrease their customer's time to resolution. One of these many tools is Skynet, a Slack bot and tool developed to orchestrate and handle the urgent incident requests received by the team.
So, what classifies as an urgent incident? The accepted definition at CenturyLink Cloud is any incident that affects two or more customers or an entire product or service. These incidents are considered urgent because they have the capacity to affect a large amount of functionality for customers. The incidents take top priority with the Customer Care team responding throughout the incident to keep customers informed of the working status in a timely manner.
The Skynet bot was designed to officially announce and prepare these incidents by automating all of the tasks needed to assemble the team, channels, tickets, history, and incident details. This helps the team, but also the customers they serve by substantially decreasing the time it takes to engage engineers to begin working on a solution. The engineers on the Customer Care team use Pager Duty, Slack, Zendesk, and Trello to orchestrate all urgent incidents.
Before the creation of Skynet, it would take several minutes just to get through the initiation process for an incident – engineers would have to collect the details, log-in to Zendesk and create a ticket, create a Slack channel for the incident, and log-in to PagerDuty to alert all the appropriate engineers, the incident commander, and the executive oversight to work the incident. The Skynet bot performs all of these activities, ensuring the engineers can get to work on the incident right away.
When an urgent incident is detected by an engineer, a "master ticket" must be created in Zendesk, which contains the incident timeline as written by the Incident Commander. Customer tickets are linked to the master for tracking. The Slack channel is used by various engineers and leaders to communicate during service restoration. PagerDuty is used to add contributors to the incident. Finally, Trello is used to log continuous improvement items (e.g. requests, concerns, ideas) for the Incident Retrospective meeting. Skynet not only prepares these tools, it allows for their continued use during the incident.
Anyone in the organization who has detected an issue can simply go into Slack and type "@skynet: fire" into any channel to signify an urgent incident. Skynet immediately gathers information on the incident by prompting a series of questions to the initiator, and then creates the previously-mentioned ticket, channel, and Trello board. Then it pages the appropriate on-call engineers with the current information and any links. Those same details are automatically communicated in Customer Care's dedicated channel so all engineers are made aware. Shortly after the page goes out, the "Incident Commander" on-call joins the channel and begins coordinating the incident. The commander alerts the Skynet bot that they are orchestrating by typing "@skynet: I am now IC" into the Slack channel. This places the ticket in their name, adds them to the timeline, and makes them visible to others in the channel. Their role is to assemble the team and work with the CapCom, executives, and engineers to investigate, mitigate, monitor, and ultimately resolve an incident. Here is an example of the SkyNet bot flow (alpha-skynet is a test bot used to evaluate the system):
There is also a "CapCom" assigned to the incident. This role's title was stolen from NASA's "capsule communicator", a person who is dedicated to communications with the crew of a spaceflight. They work with Skynet by initiating ""@skynet: I am CapCom", which assigns all linked customer tickets to them. This person is responsible for all messages to customers and outside stakeholders. They message with an update on the issue every 30 minutes and monitor customer tickets in Zendesk to ensure they all get linked to the master ticket. They also respond to questions on those tickets as they come in.
The Capcom is the heart of the communication with the customers. As engineers work to solve the problem, the Incident Commander coordinates resolution, while the CapCom makes sure the customers stay informed. CapCom also communicates workarounds to current customer issues as they are found by Customer Care, ensuring that the Customer Care team does everything they can to mitigate the side effects of urgent incidents.
During the incident Skynet is listening intently. At any point the Incident Commander can page a member of any team, that team’s on-call, or the entire team. Skynet is also listening for predefined keywords and logging those to the Trello board during the incident. When these keywords (e.g. Action Item, Follow up, Retro) are heard, Skynet logs these words and the discussion around them to the board for later review. They prove useful during the retrospective meetings so improvement suggestions brought up during the incident are not missed or forgotten.
Once the incident is resolved, Skynet steps in again to orchestrate the closing of the ticket. It creates a timeline and incident detail by pulling the incident history from the master ticket, the notes taken, the linked customer tickets, and all of the timestamps. The Incident Commander closing the incident fills in details regarding the issue, impact, and resolution.
Skynet copies all of the information and timeline into a Trello board, so that the team can "retro" on the incident and reflect on how they can better handle similar issues in the future. That info is placed into the master ticket for the Incident Response (IR), the Root Cause Analysis (RCA) ticket for the Subject Matter Expert (SME) that resolved the issue, and into the meeting invites so everyone involved during or after is clear on the issue and can review the timeline.
Prior to Skynet, creating the content, tickets, and meetings and copying the information was done manually. It took several hours to officially close out a ticket. Now, it takes about 10 minutes. While remarkable and beneficial to an Incident Commander, it also improves the customer experience by ensuring that our retrospectives are handled as close to the incident as possible. Prior to Skynet it was not unusual for this paperwork to be done the day following an incident, now it’s run at 2 p.m. or 4 a.m. and the retros are scheduled the following or same day.
The Value of Skynet
The Customer Care team developed Skynet to help decrease the response time and resolve customer issues faster, but it also has value in keeping data about incidents correct and authentic. Automation such as this is one of the many noteworthy things the team has built as part of its "continuous improvement" initiative, giving their team a very unique insight into the world of customer help and care. This continuous improvement, while benefiting the team in their efficiency, also benefits the customer, as more efficiency means more time for engineers to spend on customers. That is the true value of Skynet, and the other tools the Customer Care team has developed.
Check our out CloudTalk podcast for a more in-depth discussion of Skynet.
We’re a different kind of cloud provider – let us show you why.