Data Science, Data Analysis

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:

1. “Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;

2. “Data Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and D.J. Patil pages 70 – 76;

3. “Making Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 – 83.

All three provide food for thought; this post presents a brief summary of some of those thoughts.

One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety. All of these aspects of the Big Data characterization problem affects it, but differently:

1. For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself. Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task.

2. For high-velocity datasets – e.g., real-time, Internet-based data science courses – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect. If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face the fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate.

3. For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).

One practical corollary to these observations is the need for a computer-based data reduction processor “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization. In large organizations, this data funnel generally involves a mix of different technologies and people. While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results. In addition, to help coordinate some of these tasks, there are likely to be scripted, either in an operating a system like UNIX or in a platform-independent scripting language like Perl or Python.

An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:

1. The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport. Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet. Examples like these describe automated systems put in place to continuously exploit a large but fixed data source.

2. The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science the team scrambles to find an answer. This use is not represented by any of the examples described in these articles. In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different. In particular, the end result of a one-off analysis is a single set of results ultimately summarized to address the question originally posed. In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on-demand from a Microsoft Access database of summary information).

A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form. In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.) Also, two particularly interesting tidbits from this article was the authors’ suggestion that a good place to find data scientists are at R User Groups and their description of R as “an open-source statistical tool favored by data scientists.”

Davenport and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of Exploring Data in Engineering, Sciences and Medicine: continuous, integer, nominal, ordinal, and binary. More specifically, note that these variable types can all be included in data frames, the data object type that is best supported by R’s vast and expanding collection of add-on packages.

Learn Digital Academy

Data Science, Data Analysis

Comments

Post a Comment

Popular posts from this blog

Digital Marketing Course | Learn Digital Academy

Benefits of Graphic Designing in 2022

Becoming a Design Maestro: Your Graphic Design Course Roadmap