Blog Author: Matt Cooper

video text retouch


Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project.  One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content.  We are working towards this goal by exploiting two key tools.  First, we want to use real-time content analysis to expose useful structure within multimedia content.  Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation.  Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week.  Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines.  When a user clicks on a frame, a live cursor appears next to the nearest word.  At this point, users can alter text directly using the keyboard.  When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams.  We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

Information Interaction in Context 2014


I asked FXPAL alumni Jeremy Pickens to contribute a post on the best paper award at IIiX which is named after our late colleague Gene Golovchinsky.  For me, the episode Jeremy recounts exemplifies Gene’s willingness and generosity in helping others work though research questions.  The rest of this post is written by Jeremy.

I recently had the opportunity to attend the Information Interaction in Context conference at the University of Regensburg.  It is a conference which attempts to bring together the systems and the user perspective on information retrieval and information seeking.  In short, it was exactly the type of conference at which our colleague Gene Golovchinsky was quite at home.  In fact, Gene had been one of the chairs of the conference before his passing last year.  The IIiX organizers made him an honorary chair.  During his time as chair, Gene secured FXPAL’s sponsorship of the conference including the honorarium that accompanied the Best Paper award.  The conference organizers decided to officially give the award in Gene’s memory, and as a former FXPAL employee, I was asked to present the award and to say a few words about Gene.

I began by sharing who I knew Gene to be through the lens of our first meeting.  It was 1998.  Maybe 1999.  Let’s say 1998.  I was a young grad student in the Information Retrieval lab at UMass Amherst.  Gene had recently convinced FXPAL to sponsor my advisor’s Industrial Advisory Board meeting.  This meant that once a year, the lab would put together a poster session to give IAB members a sneak preview of the upcoming research results before they appeared anywhere else.

Well, at that time, I was kinda an odd duck in the lab because I had started doing music Information Retrieval when most of my colleagues were working on text.  So there I am at the IAB poster session, with all these commercial, industry sponsors who have flown in from all over the country to get new ideas about how to improve their text search engines…and I’m talking about melodies and chords.  Do you know that look, when someone sees you but really does not want to talk with you?  When their eyes meet yours, and then keep on scanning, as if to pretend that they were looking past you the whole time?  For the first hour that’s how I felt.

Until Gene.

Now, I’m fairly sure that he really was not interested in music IR.  But not only did Gene stop and hear what I had to say, but he engaged.  Before I knew it, half an hour (or at least it felt like it) had passed by, and I’d had one of those great engaging Gene discussions that I would, a few years later when FXPAL hired me, have a whole lot more of.  Complete with full Gene eye twinkle at every new idea that we batted around.  Gene had this way of conducting a research discussion in which he could both share (give) ideas to you, and elicit ideas from you, in a way that I can only describe as true collaboration.

After the conference dinner and presentation had concluded, there were a number of people that approached me and shared very similar stories about their interactions with Gene.  And a number of people who expressed the sentiment that they wished they’d had the opportunity to know him.

I should also note that the Best Paper award went to Kathy Brennan, Diane Kelly, and Jamie Arguello, for their paper on “The Effect of Cognitive Abilities on Information Search for Tasks of Varying Levels of Complexity“. Neither I nor FXPAL had a hand in deciding who the best paper recipient was to be; that task went to the conference organizers.  But in what I find to be a touching coincidence, one of the paper’s authors, Diane Kelly, was actually Gene’s summer intern at FXPAL back in the early 2000s.  He touched a lot of people, and will be sorely missed.  I miss him.

LoCo: a framework for indoor location of mobile devices


Last year, we initiated the LoCo project on indoor location.  The LoCo page has more information, but our central goal is to provide highly accurate, room-level location information to enable indoor location services to complement the location services built on GPS outdoors.

Last week, we presented our initial results on the work at Ubicomp 2014.  In our paper, we introduce a new approach to room-level location based on supervised classification.  Specifically, we use boosting in a one-versus-all formulation to enable highly accurate classification based on simple features derived from Wi-Fi received signal strength (RSSI) measures.  This approach offloads the bulk of the complexity to an offline training procedure, and the resulting classifier is sufficiently simple to be run on a mobile client directly.  We use a simple and robust feature set based on pairwise RSSI margin to both address Wi-Fi RSSI volatility.

h_m(X) = \begin{cases} 1 & X(b_m^{(1)}) - X(b_m^{(2)}) \geq \theta_m \\ 0 & \text{otherwise} \end{cases}

The equation above shows an example weak learner which simply looks at two elements in an RSSI scan and compares their difference against a threshold.  The final strong classifier for each room is a weighted combination of a set of weak learners greedily selected to discriminate that room.  The feature is designed to express the ordering of RSSI values observed for specific access points, and a flexible reliance on the difference between them, and the threshold \theta_m is determined in training.  An additional benefit of this choice is that processing a subset of the RSSI scan according to the selected weak learners further reduces the required computation.  Comparing against the kNN matching approach used in RedPin [Bolliger, 2008], our results show competitive performance with substantially reduced complexity.  The Table below shows cross validation results from the paper for two data sets collected in our office.  The classification time appears in the rightmost column.

We are excited about the early progress we’ve made on this project and look forward to building out our indoor location system in several directions in the near future.  But more than that, we look forward to building new location driven applications exploiting this technique which can leverage existing infrastructure (Wi-Fi networks) and devices (cell phones) we already use.

To cluster or to hash?


Visual search has developed a basic processing pipeline in the last decade or so on top of the “bag of visual words” representation based on local image descriptors.  You know it’s established when it’s in Wikipedia.  There’s been a steady stream of work on image matching using the representation in combination with approximate nearest neighbor search and various downstream geometric verification strategies.

In practice, the most computationally daunting stage can be the construction of the visual codebook which is usually accomplished via k-means or tree structured vector quantization.  The problem is to cluster (possibly billions of) local descriptors, and this offline clustering may need to be repeated when there are any significant changes to the image database.  Each descriptor cluster is represented by one element in a visual vocabulary (codebook).  In turn, each image is represented by a bag (vector) of these visual words (quantized descriptors).

Building on previous work on high accuracy scalable visual search, a recent FXPAL paper at ACM ICMR 2014 proposes Vector Quantization Free (VQF)  search using projective hashing in combination with binary valued local image descriptors.   Recent years have seen the development of binary descriptors such as ORB or BRIEF that improve efficiency with negligible loss of accuracy in various matching scenarios.   Rather than clustering the collected descriptors harvested globally from the image database, the codebook is implicitly defined via projective hashing.  Subsets of the elements of ORB descriptors are hashed by projection (i.e. all but a small number of bits are discarded) to form an index table, as below.


By creating multiple different tables, image matching is implemented by a voting scheme based on the number of collisions (i.e. partial matches) between the descriptors in a test image and those in a database image.

The paper presents experimental results on image databases that validate the expected significant increase in efficiency and scalability using the VQF approach.  The results also show improved performance over some competitive baselines in near duplicate image search.  There remain some interesting questions for future work to understand tradeoffs around the size of the hash tables (governed by the number of bits projected) and the number of tables required to deliver a desired level of performance.

Copying and Pasting from Video


This week at the ACM Conference on Document Engineering, Laurent and Scott are presenting new work on direct manipulation of video.  The ShowHow project is our latest activity involving expository or “how to” video creation and use. While watching videos of this genre, it is helpful to create annotations that identify useful frames or shots using ShowHow’s annotation capability directly, or by creating a separate multimedia notes document.  The primary purpose of such annotation is for later reference, or incorporation into other videos or documents.  While browser history might be able to get you back to a specific video you watched previously, it won’t readily get you to a specific portion of much longer source video efficiently, or provide you with the broader context in which you found that portion of the video noteworthy.  ShowHow enables users to create rich annotations around expository video that optionally include image, audio, or text to preserve this contextual information.

For creating these annotations, copy and paste functionality from the source video is desirable.  This could be selecting a (sub)frame as an image or even selecting text shown in the video.  Also, we demonstrate capturing dynamic activity across frames in a simple animated GIF for easy copy and paste from video to the clipboard.  There are interaction design challenges here, and especially as more content is viewed on mobile/touch devices, direct manipulation provides a natural means for fine control of selection.

Under the hood, content analysis is required to identify events in the video to help drive the user interaction.  In this case, the analysis is implemented in javascript and runs in the browser on which the video is being played.  So efficient means of standard image analysis tools such as region segmentation, edge detection, and region tracking are required.  There’s a natural tradeoff between robustness and efficiency here that constrains the content processing techniques.

The interaction enabled by the system is probably best described in the video below:

Video Copy and Paste Demo

Go find Scott or Laurent in Florence or contact us for more information.


on Comments (1)

Artificial intelligence has always struck me as a fittingly modest name, as I emphasize the artifice over the intelligence. Watson, a question-answering system has recently been playing Jeopardy against humans to test the “DeepQA hypothesis”:

The DeepQA hypothesis is that by complementing classic knowledge-based approaches with recent advances in NLP, Information Retrieval, and Machine Learning to interpret and reason over huge volumes of widely accessible naturally encoded knowledge (or “unstructured knowledge”) we can build effective and adaptable open-domain QA systems. While they may not be able to formally prove an answer is correct in purely logical terms, they can build confidence based on a combination of reasoning methods that operate directly on a combination of the raw natural language, automatically extracted entities, relations and available structured and semi-structured knowledge available from for example the Semantic Web.

As a researcher, I’m excited at the milestone this represents.

Continue Reading

rumblings in the times


I read newspapers (seriously, print newspapers) as they pile up around my house.  A nice thing about such piles is they don’t admit order, producing serendipitous juxtapositions (I should credit my children at this point). The data-driven life is an article by a Wired writer that looks into wearable computing and how the ability to outfit oneself with sensors might better inform decisions and behavioral strategies. By my reading, it was a basically positive take on the application of technology to help people live better lives on their own terms, whatever they might be.

Next I came across Hooked on Gadgets, and Paying a Mental Price which took a fairly negative slant, ranging somewhere between blaming technology for diminishing our quality of life and attributing to it irreversible neurological damage. Continue Reading

Is Computer Science so different?

on Comments (5)

There was an interesting article in CACM discussing an idiosyncrasy of computer science I’ve never totally wrapped my head around. Namely, conferences are widely considered higher quality publication venues to journals. Citation statistics in the article bear this perception out. My bias towards journals reflects my background in electrical engineering. But I still find it curious, having now spent more time as an author and reviewer for both ACM conferences and journals.

I think that journals should contain higher quality work. In the general case, there is no deadline for submission, and less restrictive page limits. What this should mean is that authors submit their work when they feel it is ready, and they presumably can detail and validate it with greater completeness. Secondly, the review process is also usually more relaxed. When I review for conferences, I am dealing with several papers in batch mode. For journals, things are usually reviewed in isolation. When the conference PC meets, the standards become relative. The best N papers get in, regardless of whether the N-1 or N+1 best paper really deserved it, as N is often predetermined.

Is this a good thing? Is CS that different from (all?) other fields that value journals more? On the positive side, there’s immense value in getting work out faster (journals’ downside being their publication lag) and in directly engaging the research community in person. No journal can stand in front of an audience to plead its case to be read (with PowerPoint animations no less). And this may better adapt to a rapidly changing research landscape.  On the other hand, we may be settling for less complete work. If conferences become the preferred publication venue, then the eight to ten page version could be the last word on some of the best research.  Or it may be only a tendency towards quantity at the expense of quality. Long ago (i.e. when I was in grad school), a post-doc in the lab told me that if I produced one good paper a year, that I should be satisfied with my work. I’m not sure that would pass for productivity in CS research circles today.

And this dovetails with characterizations of the most selective conferences in the article and elsewhere. Many of the most selective conferences are perceived to prefer incremental advances to less conventional but  potentially more innovative approaches.  The analysis reveals that conferences with 10-15% acceptance rate have less impact than those with 15-20% rate. So if this is the model we will adopt, it still needs some hacking…

Large Scale Image Annotation

on Comments (2)

I just attended ACM Multimedia 2009 in Beijing to present a paper on image annotation in the workshop on Large Scale Multimedia Retrieval and Mining. The multimedia research community is grappling with a dramatic increase in the scale of its information management problems in an era of rapid growth in user-generated content and negligible distribution costs (i.e. YouTube and flickr).  The workshop itself devoted attention to both retrieval and mining, while the content track of the main conference seemed to be dominated by search applications.

When the observation is made that tagged multimedia data is now freely and abundantly available, it’s usually to motivate papers on media search rather than annotation.  This is in part due to the challenges of adapting established model-based annotation methods to large media collections and large tag sets.  Alternatively, search-based annotation achieves scalability at the expense of accuracy, at least in comparison to model-based approaches.    Our workshop paper looked to combine the efficiency of search-based approaches with the accuracy afforded by model-based classification. Continue Reading