Rethinking Streaming: Correct State Matters!

by Nesime TatbulIntel Labs and MIT CSAIL; Kristin Tufte, Portland State University; and Stan Zdonik, Brown University

Nesime Tatbul Intel_Kristin Tufte Portland State U_Stan Zdonik Brown University

Stream processing has largely been thought of as real-time analytics. Data enters the system as streams and analytic functions (aggregates) are computed on the fly to reduce the latency of results. This is in advance of the data’s entering the warehouse. On the other hand, data warehouses are largely read-mostly/write-seldom. Once the warehouse is populated, it is used in read-only mode (OLAP) to answer questions until the next refresh cycle is initiated. The refresh happens typically at a very coarse grain, as large batches of new tuples are ingested on schedule by the warehouse. With this model, there is no need for shared, mutable state in the warehouse, since all updates arrive from a single place (e.g., an ETL tool or a bulk loader).

It turns out, however, that there are OLTP workloads that can benefit from streaming as well. This requires that streaming systems be able to manage shared, mutable state, since there may be multiple streaming applications at work simultaneously. Thus, in order to service these applications, we need a way to guarantee correctness in the face of concurrent updates. The S-Store system addresses this by introducing transactions to a streaming environment [4, 5]. It also guarantees ordered processing of streams and dataflow graphs, as well as exactly-once processing. The rest of this blog post will elaborate on these ideas.

“…there are OLTP workloads that can benefit from streaming as well. This requires that streaming systems be able to manage shared, mutable state, since there may be multiple streaming applications at work simultaneously. ..To service these applications, we need a way to guarantee correctness in the face of concurrent updates.”

We have recently worked with a medical dataset called MIMIC II [2, 3], which contains sensor waveforms from patient bedside devices as well as metadata about patients and text notes from doctors and nurses (including prescriptions), for about 26,000 Intensive Care Unit (ICU) admissions at Boston’s Beth Israel Deaconess Hospital. We begin by describing an example use case scenario inspired by this dataset.

In an Intensive Care Unit, tracking patient medications is critical. Studies estimate that patients are injured by preventable “adverse drug events” in hospitals between 380,000 and 450,000 times per year [1]. In our dataset, we have observed that medication administration may be the result of multiple different types of events. As shown in Figure 1, an automated system processing streams of waveform data from patient monitoring devices such as ECG sensors may trigger an emergency alert, which may indicate a need for medication. Furthermore, a doctor may periodically visit a patient and decide to administer medicine. In the MIMIC II data set, all patient medication data is stored in the medication events (MedEvents) table. To keep track of a patient’s medications, multiple separate dataflow graphs ─ streaming alert computation (consisting of three stored procedures SP1, SP2, and SP3) and scheduled doctor visits (represented by stored procedure SP4) ─ will need to read and update this shared MedEvents table. These table accesses must therefore be protected with transactions in order to avoid injuring patients. Interestingly, we observe that a similar pattern also occurs in the transportation domain, where multiple dataflow graphs within an application, such as Variable Message Signs (VMS) and Adaptive Traffic Signal Control based on real-time traffic and road conditions, do share and update common tables while also processing streaming inputs.

Figure 1: Multiple Streaming Dataflow Graphs Sharing State: A MIMIC-based Example [2, 3]

Figure 1: Multiple Streaming Dataflow Graphs Sharing State: A MIMIC-based Example [2, 3]

Correct operation of the underlying data processing system is key in applications like that above, where multiple computations manipulate and share decision-critical data. The problem is well studied in traditional database systems, and is commonly addressed via transactions with ACID guarantees, which essentially protect the database state against interference of concurrent transactions as well as transaction failures. In applications that involve both shared, mutable state and streaming dataflow graphs, ACID is still fundamental (e.g., to protect the MedEvents table in our ICU example). However, there are additional correctness guarantees that are unique to streaming that are not captured by ACID:

  • Ordered execution guarantees: Ordering is intrinsic to streaming workloads. For example, in the alert dataflow, for each incoming patient waveform, the three transactions in the dataflow (SP1, SP2, SP3) must be executed in the given order. Likewise, waveforms of a patient must be processed in the order of their occurrence to track the changes in the patient’s status properly.
  • Exactly-once processing guarantees: To guard against failures, streaming systems typically replicate streaming state, and replay them for recovery in case of a failure. It is important that the replay does not result in data loss or duplication. In our example, if a patient waveform is processed more than once, it may trigger the alert multiple times, leading to medication overdose. On the other hand, if a waveform is lost, the alert may never be detected, causing the patient to miss her medication.

The S-Store system, developed at the ISTC for Big Data, provides all of these three correctness guarantees (i.e., ACID, order, exactly once) with high performance [4, 5]. Built on the H-Store main-memory OLTP system [6], S-Store tightly integrates streaming facilities into a transactional storage and querying engine, thereby supporting hybrid big-velocity workloads that consist of streams and transactions ─ not only with stronger correctness guarantees but also with better performance compared to the state of the art in streaming and OLTP.

“S-Store tightly integrates streaming facilities into a transactional storage and querying engine, thereby supporting hybrid big-velocity workloads that consist of streams and transactions ─ not only with stronger correctness guarantees but also with better performance compared to the state of the art in streaming and OLTP.”

More details about this novel system and its use as an integral part of the BigDAWG polystore system can be found in our recent publications [3, 4, 5].

References

[1] Institute of Medicine of the National Academies, “Preventing Medication Errors”

[2] PhysioNet, MIMIC II Data Set

[3] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, S. Madden, D. Maier, T. Mattson, S. Papadopoulos, J. Parkhurst, N. Tatbul, M. Vartak, S. Zdonik, “A Demonstration of the BigDAWG Polystore System”, PVLDB 8(12), August 2015.

[4] J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, M. Stonebraker, K. Tufte, H. Wang, “S-Store: Streaming Meets Transaction Processing,” PVLDB 8(13), September 2015.

[5] N. Tatbul, S. Zdonik, J. Meehan, C. Aslantas, M. Stonebraker, K. Tufte, C. Giossi. H. Quach, “Handling Shared, Mutable State in Stream Processing with Correctness Guarantees,” IEEE Data Engineering Bulletin, Special Issue on Next-Generation Stream Processing, to appear.

[6] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi, “H-Store: A High-Performance, Distributed Main Memory Transaction Processing System,” PVLDB 1(2), August 2008.

This entry was posted in Big Data Applications, DBMS, ISTC for Big Data Blog, Streaming Big Data and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


5 + = ten