An increasing proportion of data today is generated by automated processes, sensors and devices—collectively called the Internet of Things (IoT).
Inexpensive hardware, widespread access to communication networks, and decreased storage costs have led to billions of dollars in recent commercial investment and projected 40% year-over-year data growth in overall data volume.
This increase in machine-generated data leads to a critical challenge: while data volumes rise, the capacity of human attention remains fixed. It’s increasingly impossible to understand IoT application behavior by manually inspecting it. There is simply too much data, and it will continue to arrive increasingly quickly from an increasing number of devices and device types. As a result of this “attention gap,” many important behaviors escape notice – even today.
MacroBase is a new kind of analytic monitoring engine designed to prioritize human attention in large-scale IoT data streams. MacroBase automatically highlights interesting behaviors in IoT data and produces explanations for these behaviors that can be used for tasks including diagnostics, alerting, and root cause analysis. MacroBase performs statistically-informed analytic monitoring of IoT data streams by identifying deviations within streams and generating potential explanations.
MacroBase is a new kind of analytic monitoring engine designed to prioritize human attention in large-scale IoT data streams. It delivers deliver order-of-magnitude speedups over existing, primarily non-streaming alternatives.
Over the past several years, “Big Data” engines like Spark have provided programmers with a low-level substrate for authoring scale-out data processing pipelines. MacroBase is designed to operate at the next level of abstraction, providing an end-to-end platform and a toolkit of core analytic monitoring operators that allows domain experts to quickly become productive with their data.
As such, MacroBase is the first analytics engine to combine streaming outlier detection and streaming explanation operators, allowing cross-layer optimizations that deliver order-of-magnitude speedups over existing, primarily non-streaming alternatives. The engine is extensible and fast, allowing easy adaptation to new application domains and sensor types.
Specifically, MacroBase can deliver accurate results at speeds of up to 2M events per second per query on a single core. The system has already found interesting, previously unknown behaviors and trends in production data in domains including mobile telematics, datacenter operations, electrical utilities, and satellite imaging.
For example: the Android mobile operating system ecosystem currently includes over 24,000 distinct device types. Given platform-specific differences in sensors, batteries and processing capabilities, is a given mobile application is operating correctly on each? In our experience with production IoT deployments, pernicious behaviors lurk in the combinatorial explosion of hardware-firmware-software configurations. Analytic monitoring can illuminate the combinations that matter.
Overcoming IoT Data Management Challenges
IoT data streams (1) are immense in volume, meaning that simply reporting all potentially interesting behaviors may overwhelm users; (2) contain time-sensitive data, meaning analyses must often be performed in real-time and must adapt to changes in the underlying time data; and (3) contain heterogeneous data types from a variety of sensors, meaning a tool must be flexible and permit users to efficiently express the types of behaviors that matter for them, ideally without requiring them to become experts in statistics, machine learning, and data processing.
Today, many analysis solutions address one or two of these challenges. In developing MacroBase, we interviewed systems operators and analysts in several IoT domains (including data center operations, mobile applications, industrial manufacturing.
Broadly, at scale, the state of the art relies on static alerts and primarily manual root-cause analysis, which fails to identify many kinds of behaviors (e.g., systemic inefficiency) or in a timely manner. Moreover, there is very little systems infrastructure that allows users to express monitoring queries at a high level; instead, they must write their own streaming analysis operators themselves – a tall order.
MacroBase is designed to allow the domain expert to do more with less while executing at scale over streaming data.
To enable this functionality, MacroBase is guided by two architectural principles:
- Allow “pay as you go” deployment. By default, MacroBase’s default executes an operator pipeline that provides results out of the box, without need for labels or domain knowledge. Subsequently, users can tailor queries by encoding information about their domain, both by providing supervised feedback (“show me more/less like this”) and by authoring domain-specific feature extraction operators (e.g., image convolution). This functionality makes MacroBase immediately easy to use while still providing advanced users “power tools” that allow more sophisticated analyses and automation of manual tasks.
- Optimize for laziness, or perform as little work as possible on each data point. In many applications, the most valuable information is contained in a small set of data; MacroBase aggressively prunes incoming streams to find this small set of data that matters. Following this principle leads to a variety of new, algorithmic improvements in MacroBase’s core operators.
A Flexible, Extensible Platform
In MacroBase, our goal is to provide a flexible platform for executing analytic monitoring queries in a wide variety of current and future IoT domains while leveraging specialized analytic monitoring operators for improved efficiency and accuracy.
As noted above, MacroBase executes analytic monitoring queries using a pipeline of domain-independent, unsupervised streaming detection and explanation operators that delivers results with limited end-user intervention. Users can further tailor their queries, using a set of “expert” interfaces; for example, a user could choose to employ specialized time-series explanation operators to render an alerting dashboard that illustrates important time-varying behavior leading to an equipment failure.
Ongoing research efforts with MacroBase include feature extraction and fusion over heterogeneous data sources such as images, video, and sensor data; fast techniques for automatic dimensionality reduction and multi-modal density estimation; and new data summarization techniques for highlighting behaviors in time-series and visual datasets. This research is driven by real problems we’ve encountered in real-world datasets and production IoT applications.
In summary, MacroBase:
- Prioritizes human attention by quickly finding important behavior in IoT data streams
- Combines domain-specific features with high-performance outlier detection and data summarization for improved result quality and performance
- Offers a new kind of platform for building high-performance real-time data products, more providing higher-level, more-specialized interfaces than traditional stream processors and more-flexible architecture than existing ad-hoc detection engines
MacroBase is available as open source here. We hope you’ll try it.
MacroBase is one of several data analytics engines that is integrated as part of BigDAWG, the ISTC for Big Data’s polystore database architecture. The goal of BigDAWG is to provide a repeatable, efficient and scalable system for data integration, consolidation and querying across multiple datasets. MacroBase looks for outlying behaviors in a given dataset stored in PostgreSQL based on key application metrics. In a recent Hackathon, MIT graduate student Arsen Mamikonyan used MacroBase to analyze oceanographic data via MacroBase and the BigDAWG middleware.