Here is a continually updated list of data sets we have assembled or are using in our research.
A World of Geo-coded Tweets (Web/Social Media Data Analysis)
A data set that includes links to PostgresSQL dump files containing nearly all geo-tagged Tweets and associated metadata for the whole world, along with detailed instructions for restoring this data into a working database. The data is currently being used as input into MapD (Massively Parallel Database), which uses multiple GPUs to run SQL queries as well as render point and heat maps on the data in real time.
MIMIC II (Health Care)
Data from hospital ICU information systems, hospital archives and other external data sources. Created as part of a Bioengineering Research Partnership involving an interdisciplinary team from academia (MIT), industry (Philips Medical Systems) and clinical medicine (Beth Israel Deaconess Medical Center), with the goal of developing and evaluating advanced ICU patient monitoring systems that will substantially improve the efficiency, accuracy and timeliness of clinical decision-making in intensive care.
MIMIC III (Health Care)
MIMIC-III (Medical Information Mart for Intensive Care III) is an openly available data set developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more. See: “MIMIC-III, A Freely Accessible Critical Care Database” by Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.
MODIS (Telescope/Satellite Imagery)
For our EarthDB project, we’re assembling a year’s worth of satellite imagery (from NASA’s MODIS instrument) for the whole globe. Our goal is to develop new tools for creating derived data products from the raw data, rather than requiring scientists to use existing data exports that NASA provides. We are using the Level 1B NASA data (the lowest level of raw data that is geo-referenced), which lives here. We use raw data at three spatial resolutions, available in sub-directories of the above link: MOD021KM is 1km resolution, MOD02HKM is 500m resolution, and MOD02QKM is 250m resolution. We also use the MOD03 metadata (also available in a sub-directory), and metadata from here to discriminate between data acquired in daytime and nighttime.
University of Washington CoAddition Testing Use-Case (Astronomy/Telescope Imagery)
The Large Synoptic Survey Telescope (LSST) is a large-scale, multi-organization initiative to build a new telescope and use it to continuously survey the entire visible sky. The LSST will generate tens of TB of telescope images every night. The planned survey will cover more sky with more visits than any survey before. The novelty of the project means that no current dataset can exercise the full complexity of the data expected from the LSST. For this reason, before the telescope produces its first images in a few years, astronomers are testing their data analysis pipelines, storage techniques, and data exploration using realistic but simulated images. More information on the simulation process can be found in this paper. This use-case provides a set of such simulated LSST images (approximately 1TB in binary format) and presents a simple but fundamental type of processing that needs to be performed on these images.