If you’re responsible for data analysis, data mining, business intelligence, statistical applications, predictive analytics, or related matters for your organization, you’ve probably felt one or more of these frustrations:
- You have access to an interesting large data set (inside or outside your organization) but you can’t access it efficiently and effectively. For example, it consists of spreadsheets, text or other unstructured data; or it produces too many outliers.
- You can’t efficiently and effectively correlate data sets.
- You know the kind of data you want but don’t know where to get it in a usable form.
ISTC co-director Sam Madden, who’s also Professor of Electrical Engineering and Computer Science in MIT’s Computer Science and Artificial Intelligence Laboratory (MIT CSAIL), presented a keynote address on this very topic at VLDB 2013 last August.
Specifically, he introduced DataHub, a hosted interactive data processing, sharing, and visualization platform for large-scale data analytics that is now being built at MIT CSAIL. DataHub makes it easier to
- Find, combine and clean data sets.
- Browse, visualize and query data sets in situ.
- Selectively share access and control.
- Store and protect data.
DataHub enables efficient, effective access. It includes flexible tools for ingesting and cleaning the data (for example, eliminating outliers). It gives you access to the data via a scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets.
DataHub also includes an interactive visualization system. And because DataHub is a hosted data platform, it can eliminate the need for you to manage your own database. DataHub is built on existing technologies such as:
- Monomi. Keeps data private while allowing queries on it. Paper
- Scorpion. Eliminates outliers from the output of a database query. Paper
- Massively Parallel Database (MapD). Enables you to run advanced analytic and visualization using inexpensive, off-the-shelf computer hardware. Paper
Current Status of DataHub
Researchers are making a lot of progress on DataHub and are now in the process of deploying it at MIT, for testing on MIT. Under a project called “MIT Living Lab,” researchers will allow the MIT community to access, selectively share, and use data about itself, using DataHub.
They are correlating disparate data sets; for example:
- Organization Data such as ID card swipes, network packets, expense reports, medical statistics, payroll, parking garages, HVAC, and academic publications
- Public Data such as crime statistics, local transportation, nearby restaurants, and nearby lodging
- Personal Data such as location/GPS, calendar, meetings, videos, photos, and exercise programs
As a result, DataHub could enable:
- Is going to class correlated with better grades?
- Which dining facilities are most popular with which groups?
- Bus utilization and on-demand routing
- Parking lot utilization
- Car-pool finding
Health and Medical
- Campus-wide public health; e.g., flu tracking
- Observing who is missing class or depressed
- Health signals: exercise and eating habits; partners
- Outpatient care
- Expert finding
- Data sharing between groups
DataHub builds on a lot of work by various individuals and teams at MIT CSAIL. It is a promising platform. You can download Sam’s VLDB 13 keynote address slides here: Vldb2013-Keynote-DataHub-SamMadden.