In the not too distant past, data were small and manageable. They (“data” is a plural word like numbers) were easily manipulated with a calculator or pencil and later by a computer. Small data were structured bits of information like tally marks, sales per quarter and percentages. All that information resided comfortably in a database until it was made visual by graphs and pie charts.
Big data differ from older, more familiar and traditional small data in four significant ways.
Volume: Walmart collects enough data in an hour to fill 50,000,000 filing cabinets. That volume exceeds the processing capacity of conventional databases. Without specialized software, storage, processing and querying capacity there is just too much big data to make sense of it all.
Velocity: Thanks to the Internet and mobile phones, all that volume is not just picked up and recorded. It streams into business systems on a river of bits and bytes. To be effective, business response must stream out equally fast.
After a terrorist attack in 2004, Madrid, Spain, completely revamped its emergency system. With big data monitoring points throughout the city, it is able to answer 81 percent of its police, fire and ambulance calls in less than eight minutes. That’s an extremely fast and tight feedback loop for a metropolitan area of over 230 square miles.
Variety: Some big data are tidy and structured like its smaller counterpart. These include a wide variety of test scores, financial data, personal information and the results of our last physical. Most big data are formless, messy, unstructured information like user clicks, GPS readings and social media text messages. If there is meaning here, it is not obvious. If there are trends, they are obscured by so much noise. Those who handle big data are able to extract meaning and importance from what resembles your grand-parent’s cluttered attic.
Veracity: Big, fast and formless data are also uncertain data. Uncertainties arise from incomplete data, entry errors, processing problems, sensor inaccuracies, social media, latency of information, modeling approximations and plain old deception. When big data managers speak of ensuring data quality, they are referring to its inherent lack of veracity.
A New Life Example
In the era of small data, scientists claimed in any discussion that, “Data wins.” Those with the numbers, observations and statistics trumped those that speculate from their cubicles. Today, the technology of information has expanded the old mantra; it is now, “Big Data wins.”
Here’s an example from the popular retail department store, Target that put it and statistician Andrew Pole in The New York Times Magazine and Forbes. Pole’s unique contribution to big data was that he developed a pregnancy predictor algorithm from Target’s purchase tracking card, the Guest ID, demographic data and Target’s baby registry database.
As every parent knows, pregnancy changes everything. What Target knew was that pregnancy changes and solidifies shopping habits. Pregnant women became company-loyal parents.
Pole and his team at Target’s Guest Data and Analytics Services examined all the purchases made by thousands of women who signed up for the company’s baby registry. Probing all the data at their disposal, Pole noticed the products these women purchased early in their pregnancy. The list included unscented lotion, vitamin supplements, hand sanitizers, scent-free soap and washcloths. Further study expanded the list to 25 key items.
Through statistical and other analytic tools, Pole could predict from a woman’s purchases and demographic data if she was pregnant and, within a small window of error, her due date.
He then applied his pregnancy predictor to every regular female shopper in Target’s national database. The result was a list of tens of thousands of women who were most likely pregnant. From its Guest ID data, Target knows exactly how to use coupons and ads to trigger purchases for each of those shoppers.
They then send coupons via mail, email or to the checkout counter for pregnancy- and baby-related items to women with high potential pregnancy scores. The coupons are a reminder that Target has what pregnant women need at each stage of their pregnancy and when they became mothers.
In order to avoid the charge that Target was spying or invading their privacy, the newborn-related coupons were mixed with neutral discounts.
As The New York Times Magazine reported, in 2010 Target’s Mom and Baby sales increased dramatically and Pole was promoted.
How Big is Big?
Like maternity sweatpants, digital data are produced in a variety of sizes—small, medium, large, XL and Big. Knowing the digital data prefixes is an integral part of Informatics 101.
• Kilobytes: This article contains approximately 40 kilobytes of information. A kilobyte is 1,000 bytes and this article is 40 times that figure. A byte represents one character such as a single letter in this sentence.
• Megabytes: A good high resolution digital photograph suitable for framing at a large size would ordinarily contain one megabyte or more of data. That’s one million bytes.
• Gigabytes: Seven minutes of high definition television? That’s a gigabyte or one billion bytes.
• Terabytes: The total Internet traffic for the first quarter of 1993 was a terabyte or one trillion bytes. Today, the amount of Internet data used in one second exceeds four terabytes. In a 2011 report on big data, McKinsey Global Institute estimated that by 2009, nearly all sectors in the United States economy with 1,000 or more employees had, on average, 200 terabytes of stored data.
• Petabytes: A petabyte or 1,000 terabytes is what Google all by itself processes each hour.
• Exabytes: This year, one exabyte of digital data or one million terabytes is created every 9.6 hours by the vast worldwide array of electronic devices.
• Others: There are further extensions such as zettabytes and yottabytes all useful and mind boggling descriptors quantifying the enormity of big data.
Despite the ease of finding everyday examples of terabytes, petabytes and exabytes, these large datasets are beyond the ability of ordinary software to capture, store, manage and analyze.
McKinsey Global Institute states flatly, “We are generating so much data today that it is physically impossible to store it all.” These facts of digital life today make the old maxim, “You can’t manage what you don’t measure,” even more difficult to put into practice. New tools, new software and especially new skills are needed to mine, manage and salvage big data.
Big Data in the Classroom
Charlotte’s leader in big data education is Dr. Yi (pronounced Yee) Deng. He is dean and professor of the College of Computing and Informatics at the University of North Carolina Charlotte. Deng and his 60 plus faculty members oversee the education of 1,400 computing and informatics majors and grad students, the next generation of technologists and big data miners.
“Every day we generate nine times as much information as in all of the libraries in the United States combined,” says Deng. “Ninety percent of the world’s data have been generated in the last two years.” That’s information from cell phones, iPads, wireless devices, emails and computers plus old fashioned information in the form of reports, messages, television, radio and books. To further drive home the point, Deng adds that this massive data avalanche doubles every two years.
In May, Deng and his associates hosted a major big data conference at the Ritz Carlton. Charlotte Informatics 2012: Competing + Winning through Analytics attracted over 300 local and regional business leaders. The conference panelists discussed Informatics as it refers to big data, analytics, visualization and a host of other IT-related terms that describe the collection and analysis of data in new ways to drive strategic business insights.
Deng emphasizes that Informatics is one of the most important areas of study emerging today. Then he adds a wakeup call: “Companies that employ informatics in the strategy and management of their businesses are outgrowing and outperforming their competitors.”
To drive home the point, Deng cites a big data example presented at the conference—one closer to home than Madrid’s emergency response system. “Computers can sift through mountains of bank transactions,” he says, “and detect a few odd or questionable ones.” In the past, officials had to visually examine paper records to uncover fraud or the rare money laundering scheme. Today, for bankers armed with informatics programs and visualization techniques, fraudulent transactions stand out like buying unscented lotion at Target.
Supply, Demand, Skills
UNC Charlotte teaches informatics and computer science at the undergraduate and graduate level, but they are not the Queen City’s only big data educator. Northeastern University in Charlotte offers an MBA in health informatics as a hybrid program. Six MBA students currently study big data online and on the ground, says Assistant Dean of the Graduate Program in Computer Science Bryan Lackaye. Speaking about the program in Boston, Northeastern’s main campus, “We can’t graduate students fast enough for the jobs available,” says Lackaye.
UNC Charlotte’s professional science master’s degree in bioinformatics interdisciplinary program would never be confused with an MBA. It emphasizes biology, chemistry, mathematics, statistics, computing, informatics and engineering.
Given such a rigorous program, it is no wonder these multitalented graduates are in demand. “Eighty companies have come to recruit,” says Deng. “Some are hiring 20 to 30 of our grads. Others need two or three. Demand exceeds supply.” Unfortunately, cuts to UNC Charlotte’s budget are only exacerbating the supply of what has become a scarce and important human resource.
Note that there’s one key discipline missing from the UNC Charlotte’s current informatics curriculum—business. Deng and Steve Ott, dean of UNC Charlotte’s Belk College of Business, are taking steps to combine education in business and informatics. A new North Carolina Initiative for Data Science and Analytics (NC-DSA) is in place linking the Belk College and the College of Computing and Informatics. NC-DSA rests on three pillars: new interdisciplinary academic programs in data science, state-of-the-art training in big data for working executives and managers, and a industry-university partnership that leverages academic research for business and industry innovation.
“We are in the process of creating a professional science master’s in data science and business analytics,” explains Ott. “We hope to offer it in Charlotte in a couple of years.” A new interdisciplinary professional degree in health informatics, a partnership among the College of Computing and Informatics, the College of Health and Human Services and the University Graduate School, is already being offered this year. Together these education and training programs will produce over 200 grads each year in coming years.
Writing in the October issue of Harvard Business Review, Tom Davenport, who keynoted the May conference in Charlotte, said that currently there are no university programs offering degrees in data science. He noted that North Carolina State is “busy adding big data exercises and coursework” to its master of science in analytics, reflecting the surging demand of talents in this area.
Data science is the new discipline linking informatics and practical applications. “Its practitioners are a new breed,” writes Davenport. “They are a hybrid data hacker, analyst, communicator and trusted advisor.” They focus on the “I” in IT and the “D” in R&D. All are college educated at the bachelor’s level and beyond and conversant with social media.
Some like Andrew Pole at Target are oriented toward the retail sector where they focus on determining what customers need before they know it themselves. Others are car nuts with an eye toward monitoring engine performance, customer satisfaction, social media and shop statistics to reduce repeat repairs.
In short, data scientists will have what Davenport calls “the sexiest job of the 21st century.” He equates this new profession with the “Wall Street quants” (quantitative analysts), physicists and mathematicians who shunned academia for careers with investment firms in the 1980s and ’90s.
When academically trained data scientists reach the marketplace in three to four years, they will join a self-made corps of practical data wranglers. These “grandparents” include D.J. Patil, who co-authored the Harvard Business Review article with Tom Davenport. He is an executive in residence at Greylock Partners in Silicon Valley.
So is Jake Klamka, a physicist who created Insight Data Science Fellowships, a six-week post-doctoral program based in Palo Alto, California. Klamka’s short course bridges the gap between academia and a career in data science. Add Jonathan Goldman, the data scientist that devised the “people you may know” feature on LinkedIn, a virtual space where thousands of data scientists hang out.
For parents needing guidance on post-high school education for their children, Carol Fodell has some advice. She is program director for global university programs at IBM and spoke at the May informatics conference in Charlotte on the topic. “Tell your kids to major in analytics!” she says, “That’s the bedrock of data science—the software and statistical methods organizations use to understand data and manage risk, performance and decisions.”
Even with new programs, more graduates and increased motivation for studying statistics, math, experimental design and visualization, there will continue to be a gap in the United States between high demand for big data talent and low supply. McKinsey Global Institute estimates the gap to be in the area of 50 to 60 percent in 2018.
Whether they come from do-it-yourself post-doctorate programs or UNC Charlotte, data scientists and their big data spinoff occupations are here to stay. A world without computers, smart phones and dozens of yet-to-be-invented devices may be the only aspect of big data that is unimaginable.