Ngram - Technology BloopersTechnology Bloopers

The Internet theoretically is a statistician’s dream. Let’s hope it’s not an nightmare. In our March 10, 2014 post about the irreproducible results of an Ngram search we warned that nothing prevents Google from changing their definitions or conventions … and not telling us about them. But since they tell us precious little, it seems wise not to base important conclusions or critical decisions solely on any relatively lengthy history of the counts data. And that “relatively lengthy” may be even as short as a month or a quarter, because it is easy for Google to change their mind and their software. This was brought to our attention in the December 21 New York Times by economist Seth Stephens-Davidowitz, who apparently makes a career analyzing counts produced by Google searches of certain key words or survey data collected by other surveyers. Overall, the New York Times article showed mostly upbeat behavior during the holiday season, which one would hope for. Whether the annual trends are accurate or not, likely only Google knows for sure. And we are not opining that Google is doing anything malicious in making their changes; they may all be done with the goal of improved accuracy and usability. But without more transparency we will never know.

There are scattered comments criticizing the criteria (and changes in them) that Google uses to include, and especially to rank, the results of their searches. It is tempting—but risky–to use these searches to quantify trends. And we are not even sure that Google itself understands that this is happening. Given the origins of Google, and some of their early goals, we doubt that Google intentionally is trying to mislead people who use their Ngram search, though their using vague terms to describe the number of documents is highly suspicious. In any case, ethical scientific practice requires that findings are REPRODUCIBLE. In the case of the Don’t Call Us Bossy article in the Wall Street Journal (a publication that should know better), which listed the search terms, it was not possible to arrive at a set of curves that showed the first peak in the curves during the 1930s, and there was thus no basis for the authors to draw any conclusions about the trends in that time frame. The amount of bossiness in the 1930’s is not the point here; the point is that we should be very careful to validate any supposed trends beyond just what Google searches indicate.

Technology Bloopers

Making Technology Work FOR You

Tag Archives: Ngram

Google’s Counts: Worth Every Penny You Pay For Them

Be Careful When Using Any Google or YouTube Search Quantitative Results