Two weeks ago I’ve been at the Annual Meeting of the Academy of Management (see this post). And you might have guessed, big data was a huge topic. There is tremendous potential for the use of data science in the business world. Many companies, not just the Googles of the world, sit on a wealth of underexploited data sources which bear the potential to drastically improve corporate strategy and organizational learning. But somehow people seem to believe that big data and machine learning techniques will also revolutionize the way we do science. I don’t think so. Here are three misconceptions that bother me:
“It will be all about prediction.”
Prediction is concerned with the joint distribution of two variables. Take the canonical example of Google flu trends. Google is able to forecast very accurately an outbreak of the flu simply by looking at people’s internet search terms. All they do is to essentially calculate the conditional distribution P(flu outbreak | search queries). In words: the probability of a flu outbreak given that people type in certain search queries in Google.
Although this approach is very effective nobody in his right mind would believe that searching the internet is a cause of influenza activity. Both variables just happen to occur at the same time. The old story about correlation versus causation. Getting the causality right, however, is crucial for policy making. Would you rather forbid folks to google for hot lemon drink recipes or try to spread awareness of the importance of hand hygiene in order to prevent influenza deaths?
Prediction isn’t concerned about causality as long as forecasts are accurate. Who cares about that the quality of a website is not actually determined (i.e., caused) by the number of other domains that link to it as long as Google’s page rank algorithm shows you the most relevant entries on page 1?
Prediction has always been useful. But if we want to intervene in a system (i.e., do effective policy) then we need to get the causality straight. Rather than in the conditional distribution, P(Y|X), we’re then interested in P(Y|do(X)). In words: the probability of Y given that we force X to attain a certain value.These are completely different animals and we should not lump the two together.
“It will be all about data mining.”
As prediction isn’t always sufficient, we can’t just put lots of numbers in a computer and let sophisticated machine learning algorithms do their job. Under certain circumstances knowledge about the joint distribution of variables allows you to infer an underlying causal structure. But this approach, known as causal search or causal structure learning, has severe limitations. In many cases there won’t be much progress without invoking assumptions about causal links and testing them in experiments, simply because many causal structures can produce the same observed distribution. Data mining is useful in drawing our attention to interesting phenomena. But learning about causal effects usually requires you to specify a model beforehand.
“Big data is something new.”
Data is only “big” in comparison to our data processing capabilities. The early statisticians worked with sample sizes of 50 or less observations. Compared to that the data sets every grad student collects these days are huge. But of course things become much easier if, rather than computing correlations and regression coefficients yourself, your software does the job for you. Advances in IT always had a profound influence on the possibilities to analyze data. And the more data you can handle the more statistical procedures become feasible (especially the fully non-parametric stuff). But this is an ongoing process for almost 100 years now and hardly something new.