vast_binary_code_sea.jpg

CenturyLink Cloud facilitates real-time, data-driven, situational analysis through real-time anomaly detection at a global scale.

Many practitioners and researchers are familiar with cloud computing. However, a seldom-discussed aspect of this industry is the way that we use data to operate business. From operations to marketing, our decision-makers rely on timely, accurate data about our ecosystem that they can use to make both short- and long-term decisions.

This blog discusses the analytical side of cloud computing. It covers the global landscape of thousands of interconnected devices. Additionally, it discusses how we at CenturyLink use data from these to help inform how we manage a healthy cloud ecosystem. Finally, it describes one of the analytical techniques that CenturyLink Cloud uses to achieve real-time, situational awareness about the health of every device. In particular, we will cover the following topics:

  • What data do we collect?
  • How do we make use of this data?
  • How do we create smart alerts?
  • How does all of this drive value and impact our business?

Dragnet the Infrastructure, not the Consumer

Collecting signals from every smart device results in a data lake that contains many types of signals. This digital data exhaust represents the only tangible artifacts we have to describe the interactions of customers on our platform. Customer privacy and security is of utmost importance. For this reason, we do not directly poll customer VMs. Instead, we poll the underlying infrastructure of the cloud fabric itself. Because each of our cloud is connected to the network, the cloud itself is a hyperactive Internet of things (IoT) landscape.

As you can imagine, the data we collect is both vast and highly-varied. The number and disparity of metrics leads to complexity. Different message protocols require unique polling strategies. These collectors also have to adjust the trigger-based logic used for capturing event-based messages based upon the type of device and its role in the ecosystem. However, it is not just the number of metrics that leads to complexity. The same metric on two different is also a source of complexity. For example, the polling-rate or scaling factor for each device may differ. Finally, the deployment strategy and orchestration of the cloud fabric itself may differ slightly from place-to-place. For this reason, the next step to facilitating decision-making is representing all this data in a human-readable form.

Panning for Gold: How we Sort, Filter, and Search Through our Data

All of these metrics result in an environment so busy that humans cannot easily keep track of what is taking place. One way we try to alleviate this is through the creation of a real-time dashboard intended to facilitate decision-making.

Effective decision support systems should support users’ existing search processes, should operate at human time scales, and should connect to external data when helpful. Doing this requires that the system allow users to see trends and patterns across a variety of factors. The dashboard should highlight areas that need analyst attention. To start, organizing information across platforms requires that this information is well-organized. Data has to conform to a set of shared schema. Data from different, vendors, or using differing polling formats needs to be standardized. Creating a dashboard that allows for exploration across various factors often requires connecting one data point to other meta-data that relates to network topology, device hierarchies, or application deployments. Finally, the data needs to update in real-time so that analysts can incorporate it within their existing search and discovery processes.

Streaming Analytics and Whitewater Rafting

Facilitating analysts’ search process is helpful. However, analyst attention is valuable and the data is vast. For this reason, we also need to find a way to help analysts hone in on areas that could indicate the presence of suspicious activity.

A popular technique for discovering patterns in data is machine learning. One of the weaknesses of the classic machine-learning paradigm is the fragility imposed by the requirement that the user have clearly-labeled training data. In practice this is problematic for several reasons. First, the events you are trying to detect may have never taken place before. They may also be different than similar events that have taken place. Second, even in the presence of a clearly-defined class hierarchy, it can still be difficult to figure out how to define the presence of a class label when given a set of data.

Certain techniques can help reduce the impact of these concerns. First, we can limit the scope of investigation to focus only on classes of clearly-defined problems. Second, we can use business cases or external industry guidelines to figure out how to identify the presence of a class label. For example, we can use existing guidance to define customer-impacting network events. However, these approaches are only because certain events are not just unexpected, they may have never been seen before. This makes classic machine learning fragile.

anomalies_count_histogram.png The total number of anomalies for a given time window alongside a histogram showing how those anomalies are grouping together during that same time. This is useful as a first step of analysis in order to help analysts determine if something is peculiar and, if so, where to focus their search.

The solution is to use streaming - unsupervised machine learning. This approach uses a technique called online learning, which is both kinetic and dynamic. Streaming machine learning is like whitewater rafting - you never know exactly where the current will take you. In order to succeed you have to be both flexible and strong. We create a small algorithm for each unique stream of data. This personalized algorithm captures the historic state of a data stream and uses something called micro-batches to update on-the-fly. We use this stream to predict what the future value should be and then compare and see whether the actual value violated our expectations. If the actual value is unexpected, we flag it as anomalous. We still add business and domain logic into the monitoring process, but this approach avoids the need for a single, global threshold. Instead, it gives us a useful, universal baseline for all metrics.

anomalies_devices_DC1_generic.png This is an example of the anomaly detection framework in use. The blue line represents the behavior of a device. The red line represents the predicted behavior and the grey lines represent our prediction boundaries. The yellow dots/red dots are anomalies.

Human-machine Interaction

No discussion is complete without some comments on the business value. As mentioned above, this algorithmic technique is helpful because it helps narrow the bandwidth of analyst attention necessary for examining peculiar events. This anomaly detection framework acts as a giant sieve that increases the signal-to-noise ratio of our alerts. Because it operates in real-time, it helps to reduce the time between occurrence and discovery. Because the real-time, faceted dashboard allows analysts to explore data, it can help reduce the time it takes for analysts to mitigate and solve any potential issues. Even better: since most customer-impacting events can take 30-40 minutes to manifest, this early detection technique can help reduce the occurrence of such events from ever occurring. This, we believe, will help us to continue to delight our customers.

Other Resources

Here are other topics that may be of interest to help you get started.

Designed for your business needs today and tomorrow, the CenturyLink Cloud is reliable, secure, robust, and global.

Migrate to the CenturyLink Cloud Platform with free on-boarding assistance and receive a matching spend credit based on your commitment for your initial period of platform use with us.

We’re a different kind of cloud provider – let us show you why.