Crowdsourcing Big Data

By Barzan Mozafari, Ph.D., MIT CSAIL

Crowdsourcing has become a popular means of performing tasks that are difficult for computers, including entity resolution, audio transcription, image annotation, sentiment analysis, and document summarization and editing.

Although humans are often more accurate than machines at these tasks, using humans for annotating large datasets soon becomes infeasible ― say, when dealing with Web-scale data. For instance, tens of millions of documents, images and micro-blogs are uploaded on a daily basis, while labeling an item by a human can cost several cents and take several minutes.

Recently, here in the database group we have launched a project called “Crowdsourcing Big-Data.” In this project we have worked on integrating machine learning into crowdsourcing workflows to label very large datasets faster, cheaper and even more accurately. Specifically, given a database of unlabeled items (e.g., Tweets or images), and a classifier that, once trained sufficiently, can attach a label to each item, we need to: (1) determine which items, if labeled, would be more beneficial for training and (2) determine which items are inherently hard for the classifier, no matter how much training data is provided.

We have developed an algorithm based on non-parametric bootstrap to accurately and efficiently estimate these quantities, and then to use those estimates to optimally allocate a budget (time or dollar) to acquire labels from the crowd,  in order to achieve the best overall cost or quality.

In the two diagrams above, a series of queries (here, images to be tagged) is first posed to the crowd-enabled database.  Then, our algorithm chooses which queries to ask the machine learning algorithm and which ones to ask the crowd.

While similar to the classical problem of active learning, our algorithm solves a number of crowd-related challenges that were generally not faced in traditional active learning literature, such as:

  • a higher degree of noise (as labels are provided by crowd instead of domain experts)
  • generality requirements (instead of a specific and well-understood domain, a crowd-sourced database must be able to support arbitrary, user-supplied classification algorithms)
  • scalability to handle massive datasets (limiting the number of tasks posed to the crowd, as well as limiting the training overhead of the classifier), and
  • ease of use (minimal supervision from the user).

For the same accuracy requirement, our algorithms can reduce the number of questions to the crowd by one to two orders of magnitude compared to the baseline, and by two to eight times compared to state-of-the-art active learning schemes.

To learn more,  please download our paper, Active Learning for Crowd-sourced Databases.

Barzan Mozafari is currently a Postdoc Associate at Massachusetts Institute of Technology. He earned his Ph.D. in Computer Science from the University of California at Los Angeles. He is passionate about building large-scale data-intensive systems, with a particular interest in database-as-a-service clouds, distributed systems, and crowdsourcing. In his research, he draws on advanced mathematical models to deliver practical database solutions. He has won several awards and fellowships, including SIGMOD 2012’s best paper award.


This entry was posted in Big Data Architecture, Databases and Analytics, ISTC for Big Data Blog, Math and Algorithms, Tools for Big Data and tagged , , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

one + 6 =