DBTruck: Humane Data Import

Replacing the Data Tweezers with a Data Dumptruck

By Eugene Wu, MIT CSAIL

Big Data doesn’t have to be petabytes to be big. It just needs to be difficult for the user to analyze or manage.

While databases are designed to make analyzing and managing data easier ― a single SQL statement can replace a hundred lines of scripting code ― simply getting the data into the database is a significant hurdle. It can make a gigabyte, or even a few hundred megabytes, feel like “big data” to a normal analyst.

I recently spent two hours importing a few datasets from the Federal Election Commission into PostgreSQL.  Let me share the process to import each dataset:

  1. Download a raw text file from the FEC
  2. Figure out how to delimit each column (by staring really hard)
  3. Figure out each column’s type
  4. Name each of the 20+ columns (I’ve seen 100+ column datasets!)
  5. Write and run a CREATE TABLE statement
  6. Clean the data by removing invalid values, corrupted rows, etc.
  7. Reformat the data into the proper CSV-like format that your database expects (e.g., escaping commas)
  8. Use the bulk loading COPY command
  9. Loading fails because it encountered a row that was misaligned, or contained corrupted or invalid data. Go back to step 6.
  10. Run SQL query

Each step adds to the amount of time and frustration before I can perform even a cursory analysis. This makes sense for businesses that are willing to spend days or weeks figuring out the best way to load and store and curate their data. But it does NOT make sense for a data analyst who wants to perform some quick analyses to decide if the data is even useful.  She may only care about a handful of columns, and be perfectly happy to initially ignore the small amount of bad data.

There needs to be a solution for the data analyst.

DBTruck acts as a dumptruck for data analysts, so they can load data into a database more quickly and start working with it.

DBTruck is a project to simplify the job of importing datasets “from the wild.”  Our initial tool takes a file containing tabular data and does everything it can to get it into a database.  It’ll figure out how to split your data, try to infer the data types, throw out rows that cause the loading to fail and retry.  We currently support delimited and fixed-width text files, html files, urls, and excel files.

There are several cool projects that are similar:

  • Mike Cafarella and Cloudera’s  Record Breaker is a great tool for taking a structured text file and inferring column names and column types.  I plan to integrate something like this in the future.
  • Google Refine  and  Data Wrangler  are Web-based tools that help you clean up messy data, and transform it into something ready to be loaded into a database.
  • Azza Abouzied et. al. can invisibly load MapReduce data into a database by piggybacking on the parsing step of modified MapReduce jobs.

You can download the DBTruck code here.  I hope you will, and please let me know what you think about it.

Eugene Wu is a database graduate student at MIT CSAIL, working towards his Ph.D. with the help of his advisor, Sam Madden. Previously, he graduated from the University of California at Berkeley in Fall 2006. He has published 18 scholarly pieces (the first in 2004, when he was only 18), and has balanced his academic study with internships at Google, Yahoo, Microsoft, and IBM. 

This entry was posted in Databases and Analytics, DBMS, ISTC for Big Data Blog and tagged , , . Bookmark the permalink.

Leave A Reply

Your email address will not be published. Required fields are marked *

6 + two =