The Internet theoretically is a statistician’s dream. Let’s hope it’s not an nightmare. In our March 10, 2014 post about the irreproducible results of an Ngram search we warned that nothing prevents Google from changing their definitions or conventions … and not telling us about them. But since they tell us precious little, it seems wise not to base important conclusions or critical decisions solely on any relatively lengthy history of the counts data. And that “relatively lengthy” may be even as short as a month or a quarter, because it is easy for Google to change their mind and their software. This was brought to our attention in the December 21 New York Times by economist Seth Stephens-Davidowitz, who apparently makes a career analyzing counts produced by Google searches of certain key words or survey data collected by other surveyers. Overall, the New York Times article showed mostly upbeat behavior during the holiday season, which one would hope for. Whether the annual trends are accurate or not, likely only Google knows for sure. And we are not opining that Google is doing anything malicious in making their changes; they may all be done with the goal of improved accuracy and usability. But without more transparency we will never know.
Social networks collect enormous amounts of data about people’s intentions and actions, but they have come into being so quickly that there hasn’t been time for much wisdom to have been gleaned from this data. The large majority of both the staffs of the social network companies and their users have little or no experience with the practical challenges of collecting and interpreting data. A just-published study by Derek Ruths of McGill University and Jürgen Pfeffer of Carnegie Mellon University in Science Daily warns of some of the pitfalls. Foremost among them is not dealing with the biases due to the composition of the sample. Technology Bloopers’ Statistics and Surveys webpage states at the outset “Be sure your sample is representative.” Different social networks attract different sorts of people, in terms of age, gender, ethnicity, etc. Findings based on data from one almost certainly do not represent the U.S. population as a whole. One flagrant example, which occurred decades before most of the people designing or using today’s social networks were born, was the mistaken prediction that Dewey would beat Truman in the 1948 U.S. presidential race; this was caused by a failure to sample voters properly. There are certainly a number of similar errors that have already been made by failures to understand the underlying samples from social networks’ being used for decision-making.