Medical Data and the Learning Healthcare System

By Tristan Naumann, PhD Candidate, MIT

At the recent Intel Science and Technology Center for Big Data annual Research Retreat, Professor Peter Szolovits of MIT provided steps toward realizing the “learning healthcare system” described by the Institute of Medicine (IOM) of the National Academies.

Following Dr. Leo Celi ‘s keynote, which called to attention some of the challenges facing healthcare today, Professor Szolovits described the pronounced shift from knowledge to data he has observed in medical informatics. Specifically, while tools were once created by asking clinicians to divulge their knowledge, modern tools leverage diverse sources of data ─ genomic, geographic, consumer-grade sensor, and so on.

In order to transform such data into a usable substrate for desired tasks such as building models, features are extracted. However, this process is not without its challenges. Much of human knowledge is not like physics: you don’t just learn an equation and generalize from it. Instead, it comes from myriad disparate facts. That is why “invariably, simple models and a lot of data trump more elaborate models based on less data,” as was noted by Peter Norvig in “The Unreasonable Effectiveness of Data.”

Unfortunately, this mechanism of learning presents a challenge. Traditional statistics are grounded in aggregate analysis, making it tempting to discard rarities and outliers in data. However, this is often where some of the most important insights can be found. Further complicating things are the extremely high-dimensional spaces in which such features exist. In these high-dimensional spaces, human intuition often breaks down, as described by Brian Hayes in “An Adventure in the Nth Dimension.”

Another challenge is the fact there is such a tremendous amount of variety in medical data. Common formats include:

  • tabular or structured (standardized)
  • signals (temporal)
  • narrative (free text)
  • questionnaires (semi-structured text responses)
  • imaging (simple vector-based to complex MRI)
  • environmental (other incidental data collected in great quantity)

Therefore, the best technical solutions will need to work seamlessly across multiple types of data.

Former MIT student Caleb Hug was able to demonstrate (2009) that, using MIMIC data, one could predict mortality surprisingly well, both in the context of static comparison to existing acuity scores (e.g., SAPS II) and in the context of daily acuity scores. Hug was also able to get good results for other meaningful clinical events such as pressor weaning, intra-aortic balloon pump weaning, onset of septic shock, and acute kidney injury. However, he only used the tabular data; so much more could have been done.

There’s an opportunity to use data variety coupled with new tools to create better predictive models. Such models would incorporate knowledge that patient state depends on pathophysiology (e.g., genetic complement, environmental exposures, etc.). Some models do this quite naturally but are intractable, such as POMDPs, while others do slightly worse but are substantially easier to calculate, such as Cox, naive Bayes, and linear/logistic regression.

Likewise, abstractions can be introduced in order to help guide such models. Abstractions began with separating disease clusters, conditions, and symptomatic expression into different planes. It evolved to more complex mechanisms like Radial Domain Folding, with post-doctoral student Rohit Joshi. In this abstraction, it was interesting to see how ranking domains by their relative severity corresponded relatively linearly with their actual mortality. Likewise, the emphasis on clustering based on organ system provided a natural fit for modeling severity.

All this work suggests that from an AUC standpoint, we’re getting closer to 1, but more important is the question: Where do we go from here?

The most important directions are better leveraging the non-structured data that is found in signals and narratives. Consequently, tools will need to support this type of analysis and make it seamless to integrate with the existing types of analysis. This means both finding the right abstractions to allow clinicians to reason at a higher level and providing the tools for them to do so.

Additional Resources:

Video: “How to Learn in the ‘Learning Health Care System.” Peter Szolovits, MIT CSAIL, National Library of Medicine Lecture Series, November 5, 2014

Paper: “Unfolding Physiological State: Mortality Modelling in Intensive Care Units.” Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velz, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, Peter Szolovits. KDD 2014, August 2014

“Using Big Data to Predict Mortality in ICU Patients,” ISTC for Big Data Blog, August 25, 2014

“Unfolding Physiological State and the Big Data Variety Challenge,” ISTC for Big Data Blog, April 2, 2014

This entry was posted in Analytics, Big Data Applications, ISTC for Big Data Blog and tagged , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

six − = 5