Spreadsheets and Big Data

Five Questions with Database Expert Mike Cafarella

Database expert Michael Cafarella, professor at the University of Michigan, keynoted February 1 at the annual New England Database Summit, hosted at MIT CSAIL in Cambridge, Massachusetts. Professor Cafarella’s presentation mesmerized the audience with the reality that the lowly spreadsheet is, in the aggregate, a huge repository of Big Data just waiting to be mined – data that scientists and others have slaved over for years. But how to bring it all into the Big Data universe and use it in a Big Data way? The ISTC for Big Data Blog caught up with Professor Cafarella at the Summit.

ISTC Blog: Mike, why are spreadsheets important to Big Data?

Mike Cafarella: Spreadsheets are a missed opportunity. Millions of people, including non-DBAs, use them to perform database-style tasks. The spreadsheet is the “Swiss Army Knife” for data management. Collectively, spreadsheets hold a huge amount of interesting data that well-paid people have created.

But because spreadsheets are less structured than relational databases, they exist mostly outside of mainstream data management work. It will be all to the good if we can find a capability of exploiting all that data.

Ideally, this capability will become a commercialized tool which everyone can use without the help of a DBA. With unstructured data, you tend to have a lot of variety and you can’t expect any one person to know what data is out there. It’s a lot like web search: you just assume the data is somewhere.

Imagine a tool that works roughly the same way. You type in a query that describes the data you want. The tool returns relevant data sources. You pick the one you want and manipulate it from there.

ISTC Blog: Where does machine learning come in?

Mike Cafarella: If you buy into the promise of big data, you’re also buying into the whole promise of machine learning and statistical inference and manipulation. For example, Google is using their logs to figure out how to target ads more accurately.

There’s been a ton of work in that area, but one key component – feature engineering – is not getting enough attention. You can go to a machine-learning conference and learn all the permutations on these algorithms that you like. But at some point someone has to write a feature, a piece of code that takes some kind of raw data object and distills a few statistics about it that are relevant to a machine-learning algorithm.

For example, in a search engine, you might observe that if a text query occurs in the title of the web page, it may be especially relevant. That observation is a feature.

But it’s difficult to code. In the example above, your code would go to the title field of the html document. And maybe you think that’s all the code you’re going to write.

But later you discover that there’s a very important website on which the title is inside the web page itself. So now you write a piece of code that extracts the text from the title field of the html, except when it’s on this one web page.

You run that for a while, and then you learn that the 13th most important web site in the world presents another anomaly that you have to write code to accommodate. And so on.

This trial-and-error process is typical of feature engineering on big data. The data’s so diverse that whatever mental model you have is inevitably wrong. It’s an infuriating, burdensome, awful process. Therefore, one way to unlock the promise of big data is to provide better support for feature engineers.

One type of support is the proposed tool I mentioned earlier. As the human repairs errors, the tool silently applies them everywhere they apply.

Another form of support is organizational. If you look at organizations that have built very successful large systems – for example, Google’s core rankingIBM Watson, or the Netflix prize – you see a sociology pattern that’s unusual for software engineering; two groups that don’t communicate very well slap their work together pretty casually and all of sudden it works better. In other words, Brooks’ Law (“adding manpower to a late software project makes it later”) doesn’t seem to apply to some of the trained systems.

So engineering features is difficult, but if you can make it easier for an individual then you can make it easier for the group.

ISTC Blog: How can the crowd help create smarter machines?

Mike Cafarella: The crowd is very important. Although it may be possible to synthesize a system to generate the key observations that underlie features, people are still the best way to do it.

It’s a long way from observation to code. We should focus on reducing that distance, rather than trying to engineer people out of the mix. They have a key role to play, but right now they’re getting burdened with a lot of boring work.

ISTC Blog: How will machine learning help harness social media for analysis?

Mike Cafarella: Consider the social media prediction tasks that are well known, like using Twitter searches to predict flu. One key challenge is there are a great many signals to look at. So, on the one hand you have a massive amount of information and a huge number of candidate sources of information.

But on the other hand, these social media tasks are useful exactly in those cases when you have very little conventional data. So the flu is an unusual case where we actually have pretty good conventional data. Many traditional statistical approaches to choosing relevant social media signals rely on conventional data, but in most cases you don’t have it. You need a mechanism that doesn’t rely on conventional data sources in order to figure out the relevant needles in the social media haystack.

ISTC Blog: Which vertical industries will benefit from machine learning?

Mike Cafarella: Any industry with lots of good datasets available, which usually means any industry where the cost of data collection is low.  Finance was one of the first industries to have good data, was one of the first to benefit from statistical techniques, and will likely continue to benefit. Of course on the web, everyone’s Apache servers are throwing off interesting data that they can mine for various reasons.

It’s interesting to ask which industries will have brand-new large datasets in the near future. One example is healthcare: devices are throwing off more data at the same time that legislation is giving people more incentives to find savings. Transportation is another one because it is deploying so many inexpensive sensors now.

A video of Mike Cafarella’s NEDB Summit presentation is available here.

This entry was posted in Analytics, Big Data Architecture, Databases and Analytics, DBMS, ISTC for Big Data Blog and tagged , , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

× five = 25