Unfolding Physiological State and the Big Data Variety Challenge

By Tristan Naumann, MIT CSAIL*

In exploring better ways to handle the challenge of Big Data Variety, the ISTC for Big Data has been working with the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) II database, which is composed of de-identified medical records for more than 30,000 ICU patients.

The ISTC for Big Data works with many types of data in creating advanced computer and database technologies for handling Big Data.  If you are reading this blog, then you are probably already familiar with satellite imagery, web and social media data, and genomics data. (Read the article.)  Each of the corresponding data sets constitutes an exemplar of the first two V’s of Big Data, Volume and Velocity, but perhaps leaves something to be desired with respect to the third V, Variety.  With such a great variety of dissimilar data, MIMIC II provides an excellent data set for exploring the challenges of Big Data Variety.

Modern electronic health records (EHRs) facilitate the flow of information among doctors, caregivers, and specialists. These records contain an increasingly large amount of data spanning a variety of formats and timescales.

Structured, relational formats are used to store data across many timescales. They are used for information that doesn’t generally change during the course of a stay in the intensive care unit (ICU), such as age and gender. Further, they are used for information that is typically recorded only once during the course of a stay, such as billing codes. Even for periodic events, such as intubation, the familiar relational format is employed. However, some data come too quickly.

Structured, signal formats are used for most data that are recorded faster than one Hertz. Electrocardiograms, for example, are typically stored at a resolution of at least 125 Hz. Continuous mean arterial blood pressure and fingertip plethysmograph data are recorded at a similar resolution, but signal dropouts and other sources of noise (sometimes fingertip sensors just fall off!) can further complicate analysis.  Even still, some data are too disparate to corral.

Unstructured, free-text formats contain the most descriptive and perhaps the most important information available to care staff. Patient histories, chief complaints, and other annotations are recorded in clinicians’ notes. Likewise, nursing notes document each patient’s current condition with periodic check-ins. Meanwhile, lab reports and radiology reports contain written accounts of tests ordered for diagnostic purposes.

MIMIC II includes data from hospital ICU information systems, hospital archives and other external data sources. It was created as part of a Bioengineering Research Partnership involving an interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center), with the goal of developing and evaluating advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision-making in intensive care.

Because MIMIC II contains only de-identified patient records, it is a data set that is open for general use. You can learn more about the MIMIC II data project here,

*Tristan Naumann is a Ph.D. candidate at MIT CSAIL, working with Dr. Peter Szolovits of the CSAIL Clinical Decision-Making Group.  Naumann earned his master’s and bachelor’s degrees in computer science from the Columbia University–Fu Foundation School of Engineering and Applied Science.  He has also held research, program management and product management positions at Microsoft, Google and Intel.

This entry was posted in Big Data Applications, ISTC for Big Data Blog and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


eight × 5 =