So Google recently released the Google books N-gram viewer along with the datasets.

There’s been plenty of press about it, and the Science article based on this data is an interesting read.

I was trying to come up with a simple, yet insightful query. My initial trial was modernism,postmodernism which immediately had me wondering about hyphenation or the lack thereof…  In any case, the upshot seems to be that the use of the term postmodernism started 1978ish. Neat, though I think I won’t need to clear space for my Nobel Prize anytime soon.

I toyed a little bit with other terms like generation X which has an odd sort of bump in the graph around 1970. Not sure what’s up with that, though perhaps there’s some data collection artifacting as discussed in this article.  I wasn’t inclined to deep end on this and was happy enough to have my prior knowledge confirmed by noting that the use of “generation X” took off in the mid 1990’s.

My final trial was a bit more on the minimal side: one,two,three,four,five,six,seven,eight,nine,ten. There shouldn’t be any surprise here that “one” is more common than “two” is more common than “three”, is more common than “four”. It probably shouldn’t be a surprise that each succeeding number is less frequent by roughly a factor of 2.

Google books n-gram viewer for numbers

Less intuitive (to me anyway) is that “ten” squeezes in front of “seven” and “eight” (OK, so maybe it’s a round number), “seven” and “eight” are basically tied, but even more odd is that before 1790 or so, the putative occurrence of “six” and “seven” were virtually non-existent.

Turns out it appears to be the same issue with the “medial S” that Danny Sullivan describes in greater detail in his post. In other words, it’s an artifact of OCR and an indication of the evolution of typography rather than the evolution of language.

One mystery solved; now why are “seven” and “eight” tied in frequency?

Kudos to Google for releasing the viewer and data.

