Yet, in his speech on ‘Understanding Big Data and Its Impact on Your Laboratory,’ at the AACC this July, he stepped up to the status of a prophet. Less than a week later the Centers for Disease Control and Prevention (CDC) in Atlanta were wrestling with precisely the situation he had described as the agency confronted the rapidly spreading Ebola haemorrhagic fever virus.
Mayer-Schönberger, Professor of Internet Governance and Regulation at Oxford, told a story at AACC about a different, earlier virus where public health experts could only hope to slow the spread of the epidemic. For that, they needed to track occurrences of the virus. Using a traditional method of relying upon physician reports they created a picture that was a week behind the actual incidence of the disease. ‘This is an eternity for an epidemic that’s underway,’ he stressed. Meanwhile, the internet search giant Google developed an alternative method of predicting the spread of the disease by plotting queries among the five billion requests it handles each day. Google servers are not selective, storing every request he noted, including the geographic origin of the request. ‘They struck gold,’ said Mayer-Schönberger, producing a map plotting the requests that was later validated by CDC data as being coincident with reportings of the virus. The Google method was not perfect and its effort to replicate this success, by predicting flu-related doctor visits, was double the actual number reported to CDC.
A paper in 2013 exposing what it called ‘big data hubris’ also cautioned that number-crunching algorithms should not be dismissed, but rather seen as a complement to older data-collection methods.
According to Mayer-Schönberger, Google identified a correlation between incidence of flu and data requests that, while not accurate, should not be discounted.
‘Big Data correlations will not always tell you the why of what you are seeing, but the what, and sometimes what can be good enough,’ he said, explaining that the human brain tries to makes sense of the world by creating causal connections between observed events, and scientists most especially are trained to find causes. ‘With Big Data we cannot get close to causality, to understanding the causes of things, but with Big Data correlations we can better understand what is going on, and by doing that we may get closer to the causes,’ said Mayer-Schönberger.
As an example of how Big Data correlations can sometimes be ‘good enough’, he cited the observation made by the grocery store giant Walmart that, in the days following a hurricane warning, customers stock up on a breakfast product called Pop Tarts. ‘The committee wanted to know why people did that until someone cried: “Who cares! Move the Pop Tarts closer to the cash registers,”’ he said.
Small Data thinking is predominant in the clinical laboratory today, an intellectual approach the professor defined as seeking to squeeze insights from small amounts of data, to extract a representation of reality from randomly selected samples.
This method has been shaped by the tremendous cost of more complete data collection. ‘What happens when collecting data becomes cheap?’ Mayer-Schönberger queried. The economics of data collection, data storage and data analysis have changed. The Sloan Digital Sky Survey collected more astronomical information in its first week of operation than scientists had gathered in the entire history of astronomy. Closer to home, he said, the cost of DNA sequencing has fallen from billions of dollars in 2003 to a few thousand, while the time for sequencing has dropped from years to a single day.
Small Data requires a researcher to choose a focus for investigation. Big Data invites inquiries on everything with tools enabling a researcher to zoom in or zoom out, ‘to let the data speak,’ said Mayer-Schönberger. ‘Today we are drowning in information, terabytes and petabytes of information.
‘The rate of data accumulation has exploded 100 times in just 20 years and, during this time, we moved from an analogue world to a digital world making information easier to store, to access. How we will shift from a quantity of data to a quality of data is Big Data.’
Clinical laboratories are data mines yielding massive, important information and Big Data will redefine expertise within the labs.
No one today expects decisions to be made on hunches, on the intuition of experts. Evidence-based medicine has already moved us in this direction, Mayer-Schönberger added, and concluded: ‘We will need statisticians who will become members of research teams with a greater role as quantification rises.’