I just attended ACM Multimedia 2009 in Beijing to present a paper on image annotation in the workshop on Large Scale Multimedia Retrieval and Mining. The multimedia research community is grappling with a dramatic increase in the scale of its information management problems in an era of rapid growth in user-generated content and negligible distribution costs (i.e. YouTube and flickr). The workshop itself devoted attention to both retrieval and mining, while the content track of the main conference seemed to be dominated by search applications.
When the observation is made that tagged multimedia data is now freely and abundantly available, it’s usually to motivate papers on media search rather than annotation. This is in part due to the challenges of adapting established model-based annotation methods to large media collections and large tag sets. Alternatively, search-based annotation achieves scalability at the expense of accuracy, at least in comparison to model-based approaches. Our workshop paper looked to combine the efficiency of search-based approaches with the accuracy afforded by model-based classification.
The paper is a preliminary study that aims to address two basic questions. First, can we exploit large training sets harvested from sites like flickr to improve the accuracy of annotation systems? Second, can we improve accuracy using these large data sets without prohibitive computational complexity at annotation time? Our annotation system combines a scalable search-based front end that uses nearest neighbor classification with a lightweight model-based back end using boosting. In response to the first question, our accuracy shows log-linear improvement with the number of training photos used. Furthermore, the accuracy is competitive with conventional SVM based approaches. In response to the second question, by design our system exhibits negligible increases in computational complexity at annotation time as the number of training photos increases by a factor of about 100, and is very efficient relative to conventional methods.
More details can be found in the paper.