Data Science, Data Analysis
The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:
1. “Big Data:
The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 –
68;
2. “Data
Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and
D.J. Patil pages 70 – 76;
3. “Making
Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 –
83.
All three provide food for thought;
this post presents a brief summary of some of those thoughts.
One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety. All of these aspects of the Big Data characterization problem affects it, but differently:
1. For very large data volumes, one fundamental issue is
the incomprehensibility of the raw data itself.
Even if you could display a data table with several million, billion, or
trillion rows and hundreds or thousands of columns, making any sense of this display
would be a hopeless task.
2. For high-velocity datasets – e.g., real-time,
Internet-based data
science courses – the data volume is determined by the observation time: at
a fixed rate, the longer you observe, the more you collect. If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face the fundamental trade-off between exploiting richer datasets acquired over longer
observation periods, and the longer computation times required to process those
datasets, making you less likely to keep up with the input data rate.
3. For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).
One practical corollary to these
observations is the need for a computer-based data reduction processor “data
funnel” that matches the volume, velocity, and/or variety of the original data
sources with the ultimate needs of the organization. In large organizations, this data funnel
generally involves a mix of different technologies and people. While it is not a complete characterization,
some of these differences are evident from the primary software platforms used
in the different stages of this data funnel: languages like HTML for dealing
with web-based data sources; typically, some variant of SQL for dealing with
large databases; a package like R for complex quantitative analysis; and often
something like Microsoft Word, Excel, or PowerPoint delivers the final
results. In addition, to help coordinate
some of these tasks, there are likely to be scripted, either in an operating a system like UNIX or in a platform-independent scripting language like Perl or
Python.
An important point omitted from all
three articles is that there are at least two distinct application areas for
Big Data:
1. The class of
“production applications,” which were discussed in these articles and
illustrated with examples like the un-named U.S. airline described by McAfee
and Brynjolfsson that adopted a vendor-supplied procedure to obtain better
estimates of flight arrival times, improving their ability to schedule ground
crews and saving several million dollars per year at each airport. Similarly, the article by Barton and Court
described a shipping company (again, un-named) that used real-time weather
forecast data and shipping port status data, developing an automated system to
improve the on-time performance of its fleet.
Examples like these describe automated systems put in place to
continuously exploit a large but fixed data source.
2. The exploitation
of Big Data for “one-off” analyses: a question is posed, and the data science the team scrambles to find an answer. This
use is not represented by any of the examples described in these articles. In fact, this second type of application
overlaps a lot with the development process required to create a production
application, although the end results are very different. In particular, the end result of a one-off analysis is a single set of results ultimately summarized to address the
question originally posed. In contrast,
a production application requires continuing support and often has to meet
challenging interface requirements between the IT systems that collect and
preprocess the Big Data sources and those that are already in use by the
end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment
versus periodic reports generated either automatically or on-demand from a
Microsoft Access database of summary information).
A key point of Davenport and
Patil’s article is that data science
involves more than just the analysis of data: it is also necessary to identify
data sources, acquire what is needed from them, re-structure the results into a
form amenable to analysis, clean them up, and in the end, present the
analytical results in a useable form. In
fact, the subtitle of their article is “Meet the people who can coax treasure
out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors
indicate that the term was coined in 2008 by D.J. Patil, who holds a position
with that title at Greylock Partners.) Also, two particularly interesting tidbits
from this article was the authors’ suggestion that a good place to find data
scientists are at R User Groups and their description of R as “an open-source
statistical tool favored by data scientists.”
Davenport and Patil emphasize the
difference between structured and unstructured data, especially relevant to the
R community since most of R’s procedures are designed to work with the
structured data types discussed in Chapter 2 of Exploring Data in Engineering, Sciences and Medicine: continuous, integer, nominal, ordinal, and
binary. More specifically, note that
these variable types can all be included in data frames, the data object type
that is best supported by R’s vast and expanding collection of add-on packages.
Comments
Post a Comment