The IoT landscape is complex and is becoming more involved and broad by the day. To put it simply, the Internet of Things (IoT) describes a technological landscape where many interconnected devices simultaneously create signals, track and record data, and produce reports and logs. These devices can be consumer-driven instruments such as cell phones, fitness trackers, smart home appliances, etc. They can also be industrial devices; for example, wind turbines, energy grid devices, freight GPS applications, or cloud computing devices. Both consumer and industrial devices operating within the IoT produce data reports that are used to drive behavior and business, especially in the cloud.
As such, cloud computing is both a curator and a creator of these large data sets. As one of one of the world’s largest cloud computers, CenturyLink Cloud actively monitors millions of incoming signals from the devices that constitute this cloud ecosystem.
As with other marketplaces, cloud computing benefits from the insights that big data solutions can provide. There are several classes of internal business problems cloud retailers face. At a high level these are:
- Capacity shortfall: Are we about to run out of capacity on a particular platform in the near future? Do we have enough capacity for 6 months or one year from now?
- Attacks: Are there any signs that machines on our network are compromised? Are there external attacks on localized regions of the network?
- Software configuration: Did the new deployment accidentally cause certain hardware to become unreachable? Is there a bug in the firmware on one of the devices?
- Hardware failure: Is a device or a part of a device about to fail? Were any of the optical cables recently bent too far or placed too close to a fluorescent light?
Combating these issues using data is both a technical and an analytical challenge. As the number of devices and their interconnections continue to grow, it becomes increasingly important to focus on building a business fabric that is capable of monitoring these devices and ensuring that the business can adapt in real-time. Developing a strategy for periodically polling network-connected devices and combining that strategy with the ability to react to the results of the polls will become crucial to successful deployments of physical infrastructure. A full solution will enable a business to predict, detect, and act. To get to this point, businesses will need to solve the following problems in stages:
- Build a monitoring system that offers real-time, situational awareness.
- Build a detection platform that performs real-time classification of signals that enable root-cause analysis and decision-making.
- Build a prediction platform that offers guidance for real-time actions
The remainder of this post will focus on the first item: how to build a monitoring system.
Before a business can create a monitoring solution for its IoT landscape, it should consider the possible real-time and quality assurance pitfalls that exist in its environment. When it comes to building a real-time monitoring system targeted at offering situational awareness, these pitfalls relate to data paradoxes, bandwidth limitations, and the need for resiliency.
The biggest pitfall relates to the paradoxes that come from analyzing and aggregating many disparate streams of data. Two common paradoxes relate to sample rate and to the following paradoxes:
- Spotlight paradox: We can only manage what we measure. If we are not recording data from a critical device, we will not be informed of cases where it becomes unstable.
Cyclical variation in daily temperature readings illustrates the importance of sampling temperature throughout the day. Image Source: Demming, E.W. (1994). The New Economics for Industry, Government, Education - 2nd Edition. (p. 35)
Sample rate paradox: The more we measure, the more we measure. If you observe a device once per day it may never seem unstable. If you observe it every millisecond it will seem incredulously variant. Choosing the correct rate matters.
Simpson’s paradox: The trend of the aggregate is not the trend of the parts. A majority of interfaces on a device may be unstable at a given point in time, whereas the overall device appears to be operating normally.
At an individual level, increased consumption of alcohol reduces IQ; however, if plotted in aggregate, it appears as though alcohol intake increases IQ. This takes place because in this example the higher IQ individuals consume more alcohol overall than those individuals with lower IQs. Image Source: Kievit, R.A., Frankenhuis, et al. (2013). Simpson’s paradox in psychological science. Frontiers in Psychology
- Bandwidth uncertainty: We cannot observe bandwidth without affecting it. The more simultaneous questions we ask the platform, the more these questions start to consume a tangible portion of the overall resources.
Resiliency, Auto-Discovery, and the Bat Signal
One of the primary goals of monitoring systems is to provide situational awareness during critical periods of instability. However, if your monitoring system does not have resiliency and fault tolerance built-in, then it risks failing right when you need it the most. For this reason, these deployments are often multi-tiered. A singleton deployment works well for a user interface from which others can access the results of the monitoring system. However, the collectors need to be distributed so that any localized issues within the network do not impact the ability to receive notice of network issues.
Auto-discovery is critical for ensuring the comprehensiveness of the monitoring solution. The key to successful auto-discovery is two-fold. First, make sure all of the devices conform to some external set of standards. An example of this is ensuring that devices follow the request for comments (RFC) standards put out by the Internet Engineering Task Force (IETF).
Second, make certain that the collector processes or the agents poll using a standard schema type. Schema standardization comes from industry standards and from best practices of the polling software that you use. For example, plan how to handle messages from sources such as the simple network management protocol (SNMP), Netflow, and the link layer discovery protocol (LLDP). Finally, examine the network traffic to see if there are any unknown devices present. If there are, bring them into the fold and start polling them using the correct schema for the standards on the device.
The polling strategy is one of the most critical parts of an IoT data strategy. Polling is a common pitfall that practitioners often face – partially because it seems so straightforward. Polling needs to take place at the appropriate rate for each type of signal. This solution may not be as simple as meets the eye. If we poll too frequently, the monitoring itself risks becoming a substantial user of the overall cloud resources. If we poll too infrequently, we can miss critical variations. The solution is to create an adaptive polling rate that learns the correct polling rate for each data stream.
Polling needs to have a centralized management strategy while operating a set of distributed processes across thousands of devices. Polling can also augment the collection and analyzing phases. In order to further scale the entire data platform, researchers and practitioners are increasingly pushing the intelligence of their platforms from the core to the edges. Distributed monitoring and polling allows selective message passing based on local criteria. This helps compress the volume of data sent over the wire.
Harvest the Roll-up
To prevent aggregation paradoxes it's important to identify the unique streams of data. For example, many network devices have thousands of individual sub-interfaces. This is accomplished by tracking each sub-interface separately and then constructing a high-level view of the device from the sum of these streams. This approach also allows us to focus upon the precision and validity of each measure.
In addition to intelligent aggregation, harvest the intelligence of domain experts to define smart alerting policies for each combination of device and metric. This enables explicitly-defined heuristics built on certain key performance indicators (KPIs). Increasing precision will make it easier for domain experts to recommend these heuristics.
Increasingly bandwidth, not storage space, is becoming the largest limiting factor for deployment of IoT monitoring solutions. While a distributed architecture of the monitoring system offers resiliency, it comes at the cost of bandwidth consumption. This is a challenge because of the number of messages and because of the number of unique streams of data. Combat this from both angles. Reduce the number of messages by optimizing the polling rate and by performing analysis at the edge. Reduce the number of simultaneous threads through batching a set of messages from a given interval together and sending it through contiguously. This avoids creating unnecessary connections while also benefiting from protocols that enable the transfer of larger packets.
The situational awareness and the business intelligence that this approach can provide is only the first layer. As mentioned, prediction in the IoT for the cloud landscape is a three-stage process. In the near future, we will discuss how we put real-time, intelligent alerts on top of monitoring to help root-cause analysis and empower decision-making.
For those of you working with the IoT, the CenturyLink Platform is the perfect place to build your applications. Our global network provides superior performance for your IoT sensors and cloud applications. Our managed services handle operating system and application operations – allowing you to focus on your more important objectives. Learn more about CenturyLink Cloud.
- Marketplace Partner Integration Solution: IT Monitoring Suite Adds Value That Is Hard to Ignore.
- Whitepaper: The Low Down on High Availability in the Cloud.
- CenturyLink Platform: IoT Solutions
We’re a different kind of cloud provider – let us show you why.