Big data is frequently alluded to in the contexts of for-profit ventures or organisations – especially those which leverage upon the Internet or social media platforms – and much less so, when public policies or the social sciences are involved. With large amounts of data and information now available, Seth Stephens-Davidowitz’s “Everybody Lies” argues that they can be used to answer a wider variety of socio-political questions (such as the extent to which racism cost former United States President Barack Obama substantial votes in the 2008 and 2012 presidential elections). The title, in this vein, is an allusion to the problem that survey or study respondents will not necessarily be truthful with their responses, and yet for research many social scientists (including aspiring ones, like myself), still rely on methods like surveys for data collection.
Circumventing these self-response and social desirability biases is one of the advantages of big data. Four unique powers of big data were cited: First, offering up new types of data; second, providing honest data; third, allowing us to zoom in on small subsets of people; and fourth, allowing us to do many causal experiments. Few would disagree with these observations, and in fact researchers in the social sciences would benefit by taking advantage of these openings not just for data collection, but also for the aggregation of variables to test hypotheses. With social science “becoming a real science … poised to improve our lives”, such a revolution “will come piecemeal, study by study, finding by finding”. Instead of contradicting the status quo, this seems to be a meaningful extension to existing and accumulative research endeavours, with studies built upon one another
Be that as it may, the book is but a short primer to the supposedly vast potential of big data. Stephens-Davidowitz, after all, relies heavily on Google data for his studies, and many of the relationships studied barely scratch the surface. And the degree to which they may be used by policymakers or researchers still depends on the quality of the research questions and the reliability of the data. It was said in its beginning that “Everybody Lies” is “more than a collection of odds facts or one-off studies”, yet the organisation or flow of the chapters can feel odd, and there may even be scepticism over the research questions asked. (Some of the chapters, moreover, appear to be organised by research methods – regression discontinuity and natural experiments, for instance – rather than more topical themes). Besides the limitations of statistical dimensionality and that of ethics, sampling bias is a concern too: How do researchers reach those who do not use the Internet or social media platforms? Closed social media groups? Or users who choose to remain more discrete or private online?
In other words, the fundamentals of good research do not change: Asking good research questions and operationalising them, before finding the evidence to test these hypotheses. Tracking how big data improves this, however, should be interesting.