Blog Archive: 2010

When is one>two and seven==eight?

on Comments (1)

So Google recently released the Google books N-gram viewer along with the datasets.

There’s been plenty of press about it, and the Science article based on this data is an interesting read.

I was trying to come up with a simple, yet insightful query. My initial trial was modernism,postmodernism which immediately had me wondering about hyphenation or the lack thereof…  In any case, the upshot seems to be that the use of the term postmodernism started 1978ish. Neat, though I think I won’t need to clear space for my Nobel Prize anytime soon.

I toyed a little bit with other terms like generation X which has an odd sort of bump in the graph around 1970. Not sure what’s up with that, though perhaps there’s some data collection artifacting as discussed in this article.  I wasn’t inclined to deep end on this and was happy enough to have my prior knowledge confirmed by noting that the use of “generation X” took off in the mid 1990’s.

My final trial was a bit more on the minimal side: one,two,three,four,five,six,seven,eight,nine,ten. There shouldn’t be any surprise here that “one” is more common than “two” is more common than “three”, is more common than “four”. It probably shouldn’t be a surprise that each succeeding number is less frequent by roughly a factor of 2.

Occurence of numbers in google books N-gram viewer

Google books n-gram viewer for numbers

Less intuitive (to me anyway) is that “ten” squeezes in front of “seven” and “eight” (OK, so maybe it’s a round number), “seven” and “eight” are basically tied, but even more odd is that before 1790 or so, the putative occurrence of “six” and “seven” were virtually non-existent.

Detail on number occurrences

Turns out it appears to be the same issue with the “medial S” that Danny Sullivan describes in greater detail in his post. In other words, it’s an artifact of OCR and an indication of the evolution of typography rather than the evolution of language.

One mystery solved; now why are “seven” and “eight” tied in frequency?

Kudos to Google for releasing the viewer and data.

A bit of a break


I will be taking some time off this month, and will not be posting regularly. Others may contribute in my absence.


on Comments (8)

Some members of Congress are proposing a bottom-up approach to determine which programs merit cutting. The idea is to draft cost-cutting legislation based on aggregating citizens’ opinions on what should be cut and what should be kept. One of the targets of this approach is the NSF, or, more precisely, the merits of some of the research funded by the NSF. The premise is that watchdog citizens will identify research that isn’t worth funding and will bring this to the attention of the House Committee on Science and Technology. The members of the committee, will, then, presumably, take action to save the taxpayers’ money.

What’s wrong with this picture?

Continue Reading

Google eBooks

on Comments (2)

So Google has unveiled its eBook store, setting itself up to compete with Amazon, Barnes&Noble, and everyone else selling books. Google offers its editions through the browser and on a range of devices such as Android phones and the iPad. The reading experience on the browser on my laptop was OK: not great, but the text was legible enough, and would even switch to a two-page layout in a wide window. On the iPad, Google offers two choices: the browser, and a free app. The browser interface implements a swipe gesture for page turning, although there is no visible indication that it’s possible, nor any visual feedback until the page flips. The iPad app sports an animated page turning transition, but does not have a two-page mode.

Continue Reading

Don’t go there

on Comments (4)

The field of information retrieval is inherently (some might say pathologically) data-driven. We need datasets to test algorithms, to compare systems, etc. This is all good. It’s particularly good to have data that are meaningful and relevant, because it makes it easier to motivate users and to generalize findings to data that people care about.

I expect that in the next few cycles of conference submissions, we will see a number of papers analyze the “cable” data leaked by Bradley Manning to Wikileaks. It’s a large enough dataset with topical relevance that is sure to attract all sorts of analyses, much like the Enron email dataset did in 2004.

But there are some important differences.

Continue Reading

Revealing details

on Comments (4)

Thanks to Mor Namaan, I came across an interesting blog post by Justin O’Beirne that analyzed the graphic design of several different maps — Google, Bing, and Yahoo — to show why Google maps tend appear easier to read and to use. The gist of the analysis is that legibility is improved through a number of graphical techniques that in combination produce a significant visual effect.

And of course knowing Google, this stuff was tested and tested and tested to get the right margins around text, the right gray scale for the labels, the right label density, etc.

So why did Justin have to reverse-engineer this work to understand it?

Continue Reading

No such thing as bad press?

on Comments (2)

A recent NY Times article exposed the machinations of a sleazy guy who ran an online business that relied on links — positive, negative, whatever — to his web site that caused it to be promoted in Google search results. In fact, he found that by being nasty to his customers, his rankings improved.

The Time article implies that it was his customers’ negative comments that drove up his PageRank score, but Get Satisfaction (least one of the sites on which many of the comments were posted) claims that they mark links with the “rel=nofollow” attribute, which removes that link from PageRank considerations.

So why was he as successful as the article makes it seem?

Continue Reading

Limitations of wisdom


Panos Ipeirotis published a nice analysis of the independence assumption of Surowiecki’s Wisdom of Crowds theory. In short, he finds that in some cases independence is necessary, in some cases it seems that some information leakage doesn’t hurt, and in another class of circumstances, pooling information leads to reliably better outcomes than independent guessing.

What’s going on?

Continue Reading

Technology and education

on Comments (9)

Scott McLeod’s MindDump blog featured a set of pie charts reflecting professors’ use of technology. The charts are reproduced from a piece in the Chronicle of Higher Education, and is based on a survey of about 4,600 professors from 50 Universities, collected in the spring of 2009. The piece cites, but does not link to the actual study results. Some poking around turned up the FSSE site, but I was unable to find the cited data there. The closest I found was a page reporting on the use of communication technologies, which seemed to reflect different numbers of respondents.

Nonetheless, assuming that the data are not bogus, we can ask some questions about what this means.

Continue Reading

An exploration of cross-media interaction

on Comments (1)

One of FXPAL’s papers at the ACM Multimedia conference this year describes FACT, an interactive paper system for fine-grained interaction with documents. The FACT system consists of a small camera-projector unit, a laptop, and ordinary paper documents. The system works as follows: a user makes pen gestures on a paper document in the view a of a camera-projector unit. FACT processes these gestures to select fine-grained content and to apply various digital functions. For example, the user can choose individual words, symbols, figures, and arbitrary regions for keyword search, copy and paste, web search, and remote sharing. FACT thus enables a computer-like user experience on paper. This paper interaction can be integrated with laptop interaction for cross-media manipulations on multiple documents and views. FACT can be used in the application areas such as document manipulation, map navigation and remote collaboration.