We define Big Data as data that is too big, too fast, or too hard for existing tools to process.
- “Too big” means that organizations increasingly have to deal with petabyte-scale collections of data, which come from click streams, transaction records, sensors, and many other places.
- “Too fast” means that there is a lot of data that needs to be processed quickly – for example, to perform fraud detection at a point of sale, or determine what ad to show to a user on a web page, or to re-route traffic in a congested city.
- “Too hard” is a catchall for data that doesn’t fit neatly into existing processing tools – for example, data that needs more complex analysis than existing tools can readily provide.
Here are some of the problems that our research may help solve. Similar challenges arise across most industry sectors today including insurance, retail, telecommunications and energy.
Many Web sites now register millions of unique visitors per day. Each of these visitors may access and create a range of content. This can easily amount to tens to hundreds of gigabytes per day (tens of terabytes per year) of accumulating user and log data, even for medium-sized websites. Increasingly, companies want to be able to mine this data to understand limitations of their sites, improve response time, offer more targeted ads, and so on. Doing this requires tools that can perform complicated analytics on data that far exceeds the memory of a single machine or even a cluster of machines.
Banks and other financial organizations have vast quantities of data about consumer spending habits, credit card transactions, financial markets, and so on. This data is massive: for example, Visa processes more than 35B transactions per year; if they record 1 KB of data per transaction, this represents 3.5 petabytes of data per year. Visa, and large banks that issue Visa cards, would like to use this data in a number of ways: to predict customers at risk of default, to detect fraud, to offer promotions, and so on. This requires complex analytics. Additionally, this processing needs to be done quickly and efficiently, and needs to be easy to tune as new models are developed and refined.
Sensors offer the potential to continuously monitor a patient’s health. Recent advances in wireless networking, miniaturization of sensors via MEMS processes, and incredible advances in digital imaging technology have made it possible to cheaply deploy wearable sensors that monitor a number of biological signals on patients, even outside of the doctor’s office. These signals measure functioning of the heart, brain, circulatory system, and so on. Additionally, accelerometers and touch screens can be used to assess mobility and cognitive function. This creates an unprecedented opportunity for doctors to provide outpatient care, by understanding how patients are progressing outside of the doctor’s office and when they need to be seen urgently. And by correlating signals from thousands of different patients, doctors and clinical researchers may be able to develop a new understanding of what is normal or abnormal, or what kinds of signal features indicate potentially serious problems.
Biotechnology and Drug Discovery
Businesses that use novel instruments – such as gene-sequencing systems – have some of the biggest challenges with Big Data. Today, gene sequencing is used extensively by biotechnology companies and pharmaceutical companies to develop more-effective drugs or novel therapies for diseases. Gene sequencing has both Big Data and Big Velocity problems. Today, biologists have “pipelines” to clean the sequencing data and then “cook” the raw data into usable form. This cooking involves putting all of the “short reads” together into a single human genome and then looking for interesting strings of base pairs in the result. Biologists envision storing sequence data for millions of humans, so they can perform data mining looking for genomic patterns that identify particular diseases. Using current technologies, running a single analysis can take days, slowing the scientific process and putting a practical limit on the number of analyses. With new computational technologies for Big Data, we could dramatically reduce current processing times, reducing biologists “time-to-insight” by an order of magnitude.
The sciences have always been big consumers of Big Data, with the added challenge of analyzing disparate data from multiple instruments and sources. For example, astronomy involves continuously collecting and analyzing huge amounts of image data, from increasingly sophisticated telescopes.
The next “big science” astronomy project is the Large Synoptic Survey Telescope (LSST). The telescope is being built in Chile, and will ultimately collect about 55 Petabytes of raw data. There is a streaming pipeline where software looks for patterns in the images (e.g., stars and other celestial objects) and then looks for the same object in different telescope images to obtain trajectory information. All of the data will be stored, and astronomers want to reprocess the raw imagery with different algorithms, as there is no universal image-processing algorithm that pleases all astronomers. For example, different astronomers use different thresholds for qualifying observations as either data or noise. New Big Data technologies can provide the capacity to store, process, analyze, visualize and share large amounts of image data, as well as remote sensing data from satellites. Astronomy looks out to the sky from telescopes; remote sensing looks in towards the earth from satellites. The two are roughly mirror images of each other, presenting an opportunity for new approaches to analyzing both kinds of data simultaneously.
Transportation and Government
As cities experience greater rates of population growth, municipal governments are struggling to keep ahead of demands for public safety, infrastructure, environmental management, transportation, and energy. Cities generate a wealth of Big Data that can help manage growth.
Imagine an “urban-scale” sensing system for a large city, such as Boston or Portland. Such a system gathers input from a large number of sensors: for example, traffic/security cameras, GPS-enabled vehicles and devices, cellphones, electricity and water monitors, and weather and air-quality sensors. In addition, the system pulls information from a number of static data sources, including information about various events (road blockages, NBA games), maps of streets, parks, buildings, restaurants and shops, and so on.
New computational, array and spatial computing technologies could enable a city to store, combine and analyze this disparate data along multiple dimensions to explore the ways it can inform individual users to promote sustainability, safety, and health. For example, current air-quality alerts are typically for a whole day over an entire metropolitan region. However, gas levels and particulate concentrations vary considerably at the scale of minutes and meters, especially near roadways. With the right Big Data framework, the city could issue much more specific alerts to pedestrians and cyclists on harmful emissions levels, and, conversely, manage traffic to minimize emission production where there are concentrations of people.