data_science2.png

A previous post discussed our approach to real-time anomaly detection. Recalling that for every unique stream of data, we predicted what the actual value should be. Then, if an actual value was different than our expectations, we flagged it as anomalous. Combining this anomaly detection approach with our real-time, faceted dashboard allows analysts to easily pinpoint areas of the cloud infrastructure that behave peculiarly.

This article expands on anomaly detection and dives into further detail on how to use data to detect and prevent customer-impacting events from taking place. Additionally, we’ll discuss how we further automate the analysts’ search process, determine the presence of customer impact, and integrate our predictions within the rest of the business to facilitate resolution. This post will cover:

  • The similarities and differences in the lifecycle of data science and that of system architects
  • How we detect the presence of unknown events
  • How to determine whether an event is potentially impacting
  • How to integrate this information into business’ operations

The Lifecycle of Data Science

There are at least as many definitions of data science as there are data scientists. At its core, data science requires both data and science. This means that these professionals are intimately familiar with extracting, cleaning, processing, and analyzing data. It also means that they are familiar with the research methods necessary to analyze the complex technical and social settings indicative of modern business. Rather than splitting hairs over skill-sets, we can instead focus on the lifecycle of data science as it relates to the business lifecycle.

The goal of an applied scientist is to move from theory towards application. Data scientists take theories from external sources such as research papers or from domain experts and they test them. They build prototypes suited for solving the business scenario at hand, and create models useful for explaining future business problems. These models should facilitate and guide domain experts. Sometimes these models turn into recipes that the business can use to automate some, or all of this research. If successful, these automated solutions can help reduce ambiguity in future, unknown settings.

Creating a Search Engine

Within the CenturyLink Cloud, one such model we have built strives to help prevent and/or mitigate incidents through automating our analysts’ ability to search for problems. Many customer-impacting incidents create patterns that our analysts can clearly detect upon visual inspection. This led us to believe that we could successfully build an incident-detection technique. However, across incidents there are often only slight similarities in the historic data for the involved devices. In addition, it has been difficult to determine all the devices involved in any given incident. For this reason, we knew that pattern-detection algorithms would not be successful in future, unknown scenarios.

Figure 1: This plot shows the 12 hours of data preceding an incident for data streams representing involved devices. Each data stream is normalized. Note the concomitant change in behavior during the onset of the incident around 4pm. The yellow marks indicate the presence of anomalies in these time series data sets.

example_concommitant_variance.png

From my perspective, this algorithm uses unsupervised learning to create a search engine for time series. We can find patterns that look similar to known bad patterns, or we can group data streams based upon similarities.

Fortunately, we could still leverage the pattern of success we had in directing analyst attention to areas of the cloud fabric undergoing concomitant, or peculiar behavior. We discovered that through reasoning, both vertically across devices and horizontally across time, we could begin to automate some of the analysts’ search processes. This allowed us to detect periods of instability. This algorithm uses unsupervised learning to create a search engine.

Figure 2: A histogram plot of the presence of anomalies across involved actors during an incident. This plot has one row for each stream of data (x-axis) across time (y-axis). Anomalies are noted via red markings. Note the alignment of anomalies across devices during an incident. Note also the leading indications of instability.

example_discretizerd_anomaly_plot.png

However, whereas a typical search engine finds similarities in text, this approach finds similarities in sensor signals. We can find patterns that look similar to known bad patterns, or we can group data streams based upon similarities. This algorithm keeps track of the overall state of each data center in real time. Then, in streaming fashion, it updates its beliefs regarding the stability of the data center. Because this algorithm does not have to recompute this state each time it receives new data, it is also highly-scalable. Finally, we can draw inferences across different groups of algorithms by using a graphical inference technique for community detection. This results in multiple sets where each could qualify as an incident. Through performing significance tests on each set, we are left with a smaller set of candidate incidents.

Figure 3: This image shows an example graph containing nodes {A,B,C,D,E,F,G,H}. Here, node C, the search node, is linked to other nodes which recently exhibited similar behaviors. The set of nodes {A,B,C} demarcated via a dashed line represent the resulting set of significant nodes.

example_incident_graph.png

Canary in the Cloud Mine

Just because we can identify groups of devices behaving peculiarly does not necessarily mean that we have identified a customer-impacting incident. There are several normal types of events that could cause devices to seemingly behave peculiarly. For example, the deployment of a new, large-scale application could heavily draw upon cloud resources. For this reason, it is also necessary to connect a candidate incident to operational data that could indicate the presence of customer impact.

To accomplish this, we had to build a ghost in the machine. These canaries are rudimentary virtual machines that perform simple, simulated customer actions. Each canary sits inside a much larger computer. The idea is that if the canary is having problems on that computer, then some of the other tenants may also be having problems. The collective success or failure of these actions indicates whether the cloud fabric is stable for customer operations.

At this point, readers may be wondering: why bother grouping data together into candidate alerts if you have canaries set-up to identify the same concept? This is a good question that gets at the notion of false positive and false negative. While the presence of either a candidate incident or a canary incident may indicate a period of instability, it is the conjoint set of evidence that gives us the strong support necessary to involve humans in the investigative efforts.

Pulling the Fire Alarm

There is a reason why some fire alarms squirt permanent ink at the user: we wish to discourage false alarms. If the user is a prankster, then their fate is sealed. But, if the user is a good Samaritan, then no harm nor foul awaits. When involving a committee of highly-skilled domain experts in a business process that will take them away from their otherwise packed schedules, it is important that we do so in a way that builds confidence and facilitates resolution.

We build confidence by presenting comprehensive evidence that combines information from a variety of sources into a single, comprehensive report. This report supports resolution when it fits well into the existing operational life cycle. First, we build support by sharing candidate alerts in a timely fashion. We do this by placing the alert into the instant messenger application these experts use to converse with each other. Next, if elevated by users, we place the candidate alert within the business' ticketing system. This involves an additional set of users who will determine whether we need to take action. If we do, the final lever is pulled and we page our on-call superheroes.

Our work is not done. Through involving the analyst in a feedback loop, the analyst helps us tune the algorithm by indicating whether a candidate incident was a false positive. If it was not a false positive, then the analyst also labels it based upon root cause. This form of active learning has a second advantage – it also allows us to clearly indicate the scope and duration of devices involved in each type of incident. This brings us closer to creating data that we can use for training future pattern-detection algorithms.

Our goal is to increasingly drive towards an automated incident resolution system. Doing this requires building an intelligent system that can derive its own, informed decisions without explicit guidance from users. Through learning more about which types of incidents relate to which types of actions, we can begin to recommend actions for analysts to take. Over time, we can even begin to take automated actions. As humans and machines work together we can build a future that is more scalable and more secure.

Other Resources

Here are other topics that may be of interest to help you get started.

Designed for your business needs today and tomorrow, the CenturyLink Cloud is reliable, secure, robust, and global.

Migrate to the CenturyLink Cloud Platform with free on-boarding assistance and receive a matching spend credit based on your commitment for your initial period of platform use with us.

We’re a different kind of cloud provider – let us show you why.