Blog Archive: 2014

Introducing cemint

on

At FXPAL we have long been interested in how multimedia can improve our interaction with documents, from using media to represent and help navigate documents on different display types to digitizing physical documents and linking media to documents.

21

In an ACM interactions piece published this month we introduce our latest work in multimedia document research. Cemint (for Component Extraction from Media for Interaction, Navigation, and Transformation) is a set of tools to support seamless intermedia synthesis and interaction. In our interactions piece we argue that authoring and reuse tools for dynamic, visual media should match the power and ease of use of their static textual media analogues. Our goal with this work is to allow people to use familiar metaphors, such as copy-and-paste, to construct and interact with multimedia documents.

Cemint applications will span a range of communication methods. Our early work focused on support for asynchronous media extraction and navigation, but we are currently building a tool using these techniques that can support live, web-based meetings. We will present this new tool at DocEng 2014 — stay tuned!

To cluster or to hash?

on

Visual search has developed a basic processing pipeline in the last decade or so on top of the “bag of visual words” representation based on local image descriptors.  You know it’s established when it’s in Wikipedia.  There’s been a steady stream of work on image matching using the representation in combination with approximate nearest neighbor search and various downstream geometric verification strategies.

In practice, the most computationally daunting stage can be the construction of the visual codebook which is usually accomplished via k-means or tree structured vector quantization.  The problem is to cluster (possibly billions of) local descriptors, and this offline clustering may need to be repeated when there are any significant changes to the image database.  Each descriptor cluster is represented by one element in a visual vocabulary (codebook).  In turn, each image is represented by a bag (vector) of these visual words (quantized descriptors).

Building on previous work on high accuracy scalable visual search, a recent FXPAL paper at ACM ICMR 2014 proposes Vector Quantization Free (VQF)  search using projective hashing in combination with binary valued local image descriptors.   Recent years have seen the development of binary descriptors such as ORB or BRIEF that improve efficiency with negligible loss of accuracy in various matching scenarios.   Rather than clustering the collected descriptors harvested globally from the image database, the codebook is implicitly defined via projective hashing.  Subsets of the elements of ORB descriptors are hashed by projection (i.e. all but a small number of bits are discarded) to form an index table, as below.

VQFMarch2014

By creating multiple different tables, image matching is implemented by a voting scheme based on the number of collisions (i.e. partial matches) between the descriptors in a test image and those in a database image.

The paper presents experimental results on image databases that validate the expected significant increase in efficiency and scalability using the VQF approach.  The results also show improved performance over some competitive baselines in near duplicate image search.  There remain some interesting questions for future work to understand tradeoffs around the size of the hash tables (governed by the number of bits projected) and the number of tables required to deliver a desired level of performance.