Blog Category: Uncategorized

DocEng 2015



FXPAL had two publications at DocEng 2015. The conference was in Lausanne, Switzerland.

“High-Quality Capture of Documents on a Cluttered Tabletop with a 4K Video Camera”

“Searching Live Meetings: “Show me the Action”

Some observations from FXPAL colleagues

Jean Paoli, co-author of XML, opened the DocEng 2015 conference by taking us back to the early days of SGML all the way to JSON and Web Components, remembering along the way OLE. Jean believes in a future where documents and data are one, where documents are comprised of chunks of content manually authored along with automatically produced components such as graphics, tables, etc. He questioned the kinds of user interfaces required to produce these documents, how to consume them and reuse in turn their parts.

In “The Browser as a Document Composition Engine”, Tamir and his colleagues from HP Labs explained how printing web pages was still a bad experience for most users today. They developed a method to generate a beautifully formatted PDF version of web pages; the tool selects article content, fits them into appropriate templates and uses only the browser to measure how each character fits on the page. The output is PDF, which is ubiquitous to finally print the rendered web page, but previewing the result inside the web browser before printing is also possible. Decluttering web pages is still a manual or semi-automatic process where users tag page elements before printing, but they promised an upcoming paper on that subject. Stay tuned.

Tokyo university also had an interesting take on improving document layout; instead of playing with character spacing to avoid orphans and word splits at the end of lines, they chose a Natural Language Process (NLP) approach where terms are replaced with synonyms (paraphrased) until the layout becomes free of layout errors. Nice way to tie NLP with document layout.

Continue Reading

MixMeet: Live searching and browsing


Knowledge work is changing fast. Recent trends in increased teleconferencing bandwidth, the ubiquitous integration of “pads and tabs” into workaday life, and new expectations of workplace flexibility have precipitated an explosion of applications designed to help people collaborate from different places, times, and situations.

Over the last several months the MixMeet team observed and interviewed members of many different work teams in small-to-medium sized businesses that rely on remote collaboration technologies. In work we will present at ACM CSCW 2016, we found that despite the widespread adoption of frameworks designed to integrate information from a medley of devices and apps (such as Slack), employees utilize a surprisingly diverse but unintegrated set of tools to collaborate and get work done. People will hold meetings in one app while relying on another to share documents, or share some content live during a meeting while using other tools to put together multimedia documents to share later. In our CSCW paper, we highlight many reasons for this increasing diversification of work practice. But one issue that stands out is that videoconferencing tools tend not to support archiving and retrieving disparate information. Furthermore, tools that do offer archiving do not provide mechanisms for highlighting and finding the most important information.

In work we will present later this fall at ACM MM 2015 and ACM DocEng 2015, we describe new MixMeet features that address some of these concerns so that users can browse and search the contents of live meetings to retrieve rapidly previously shared content. These new features take advantage of MixMeet’s live processing pipeline to determine actions users take inside live document streams. In particular, the system monitors text and cursor motion in order to detect text edits, selections, and mouse gestures. MixMeet applies these extra signals to user searches to improve the quality of retrieved results and allow users to quickly filter a large archive of recorded meeting data to find relevant information.

In our ACM MM paper (and toward the end of the above video) we also describe how MixMeet supports table-top videoconferencing devices, such as Kubi. In current work, we are developing multiple tools to extend our support to other devices and meeting situations. Publications describing these new efforts are in the pipeline: stay tuned.



At FXPAL, we build and evaluate systems that make multimedia content easier to capture, access, and manipulate. In the Interactive Media group we are currently focusing on remote work and distributed meetings in particular. On one hand, meetings can be inefficient at best and a flat-out boring, waste-of-time at worst. However, there are some key benefits to meetings, especially those that are more ad hoc and driven by specific, concrete goals. More and more meetings are held with remote workers via multimedia-rich interfaces (such as HipChat and Slack).  These systems augment web-based communication with lightweight content sharing to reduce communication overhead while helping teams focus on immediate tasks.

We are developing a tool, MixMeet, to make lightweight, multimedia meetings more dynamic, flexible, and hopefully more effective. MixMeet is a web-based collaboration tool designed to support content interaction and extraction for use in both live, synchronous meetings as well as asynchronous group work. MixMeet is a pure web system that uses the WebRTC framework to create video connections. It supports live keyframe archiving and navigation, content-based markup, and the ability to copy-and-paste content to personal or shared notes. Each meeting participant can flexibly interact with all other clients’ shared screen or webcam content.  A backend server can be configured to archive keyframes as well as record each user’s stream.

Our vision for MixMeet is to make it easy to mark up and reuse content from meetings, and make collaboration over visual content a natural part of web-based conferencing. As you can see from the video below, we have made some progress toward this goal. However, we know there are many issues with remote, multimedia-rich work that we don’t yet fully understand. To that end, we are currently conducting a study of remote videoconferencing tools. If your group uses any remote collaboration tools with distributed groups please fill out our survey.

on automation and tacit knowledge


We hear a lot about how computers are replacing even white collar jobs. Unfortunately, often left behind when automating these kinds of processes is tacit knowledge that, while perhaps not strictly necessary to generate a solution, can nonetheless improve results. In particular, many professionals rely upon years of experience to guide designs in ways that are largely invisible to non-experts.

One of these areas of automation is document layout or reflow in which a system attempts to fit text and image content into a given format. Usually such systems operate using templates and adjustable constraints to fit content into new formats. For example, the automated system might adjust font size, table and image sizes, gutter size, kerning, tracking, leading, etc. in different ways to match a loosely defined output style. These approaches can certainly be useful, especially for targeting output to devices with arbitrary screen sizes and resolutions. One of the largest problems, however, is that these algorithms often ignore what might have been a considerable effort by the writers, editors, and backshop designers to create a visual layout that effectively conveys the material. Often designers want detailed control over many of the structural elements that such algorithms adjust.

For this reason I was impressed with Hailpern et al.’s work at DocEng 2014 on document truncation and pagination for news articles. In these works, the authors’ systems analyze the text of an article to determine pagination and truncation breakpoints in news articles that correspond to natural boundaries in articles between high-level, summary content and more detailed content. This derives from an observation that journalists tend to write articles in “inverted pyramid” style in which the most newsworthy, summary information appears near the beginning with details toward the middle and background info toward the end. This is a critical observation in no small part because it means that popular newswriting bears little resemblance to academic writing. (Perhaps what sets this work apart from others is that the authors employed a basic tenet of human-computer interaction: the experiences of the system developer are a poor proxy for the experiences of other stakeholders.)

Foundry, which Retelny et al. presented at UIST 2014, takes an altogether different approach. This system, rather than automating tasks, helps bring diverse experts together in a modular, flexible way. The system helps the user coordinate the recruitment of domain experts into a staged workflow toward the creation of a complex product, such as an app or training video. The tool also allows rapid reconfiguration. One can imagine that this system could be extended to take advantage of not only domain experts but also people with different levels of expertise — some “stages” could even be automated. This approach is somewhat similar to the basic ideas in NudgeCam, in which the system incorporated general video guidelines from video-production experts, templates designed by experts in the particular domain of interest, novice users, and automated post hoc techniques to improve the quality of recorded video.

The goal of most software is to improve a product’s quality as well as efficiency with which it is produced. We should keep in mind that this is often best accomplished not by systems designed to replace humans but rather those developed to best leverage people’s tacit knowledge.

video text retouch


Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project.  One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content.  We are working towards this goal by exploiting two key tools.  First, we want to use real-time content analysis to expose useful structure within multimedia content.  Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation.  Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week.  Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines.  When a user clicks on a frame, a live cursor appears next to the nearest word.  At this point, users can alter text directly using the keyboard.  When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams.  We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

Do Topic-Dependent Models Improve Microblog Sentiment Estimation?


When estimating the sentiment of movie and product reviews, domain adaptation has been shown to improve sentiment estimation performance.  But when estimating the sentiment in microblogs, topic-independent sentiment models are commonly used.

We examined whether topic-dependent models improve performance when a large number of training tweets are available. We collected tweets with emoticons for six months and then created two types of topic-dependent polarity estimation models:  models trained on Twitter tweets containing a target keyword and models trained on an enlarged set of tweets containing terms related to a topic. We also created a topic-independent model trained on a general sample of tweets. When we compared the performance of the models, we noted that for some topics, topic-dependent models performed better, although for the majority of topics, there was no significant difference in performance between a topic-dependent and a topic-independent model.

We then proposed a method for predicting which topics are likely to have better sentiment estimation performance when a topic-dependent sentiment model is used. This method also identifies terms and contexts for which the term polarity often differs from the expected polariy. For example, ‘charge’ is generally positive, but in the context of ‘phone’, it is often negative. Details can be found in our ICWSM 2014 paper.

AirAuth: Authentication through In-Air Gestures Instead of Passwords


At the CHI 2014 conference, we demonstrated a new prototype authentication system, AirAuth, that explores the use of in-air gestures for authentication purposes as an alternative to password-based entry.

Previous work has shown that passwords or PINs as an authentication mechanism have usability issues that ultimately lead to a compromise in security. For instance, as the number of services to authenticate to grows, users use variations of basic passwords, which are easier to remember, thus making their accounts susceptible to attack if one is compromised.

On mobile devices, smudge attacks and shoulder surfing attacks pose a threat to authentication, as finger movements on a touch screen are easy to record visually and to replicate.

AirAuth addresses these issues by replacing password entry with a gesture. Motor memory makes it a simple task for most users to remember their gesture. Furthermore, since we track multiple points on the user’s hands, we do obtain tracking information that is unique to the physical appearance of the legitimate user, so there is an implicit biometric built into AirAuth. Smudge attacks are averted due to the touchless gesture entry and a user study we conducted shows that AirAuth is also quite resistant towards camera-based shoulder surfing attacks.

Our demo at CHI showed the enrollment and authentication phases of our system. We gave attendees the opportunity to enroll in our system and check AirAuth’s capabilities to recognize their gestures. We got great responses from the attendees and obtained enrollment gestures from a number of them. We plan to use these enrollment gestures to evaluate AirAuth’s accuracy in field conditions.

Copying and Pasting from Video


This week at the ACM Conference on Document Engineering, Laurent and Scott are presenting new work on direct manipulation of video.  The ShowHow project is our latest activity involving expository or “how to” video creation and use. While watching videos of this genre, it is helpful to create annotations that identify useful frames or shots using ShowHow’s annotation capability directly, or by creating a separate multimedia notes document.  The primary purpose of such annotation is for later reference, or incorporation into other videos or documents.  While browser history might be able to get you back to a specific video you watched previously, it won’t readily get you to a specific portion of much longer source video efficiently, or provide you with the broader context in which you found that portion of the video noteworthy.  ShowHow enables users to create rich annotations around expository video that optionally include image, audio, or text to preserve this contextual information.

For creating these annotations, copy and paste functionality from the source video is desirable.  This could be selecting a (sub)frame as an image or even selecting text shown in the video.  Also, we demonstrate capturing dynamic activity across frames in a simple animated GIF for easy copy and paste from video to the clipboard.  There are interaction design challenges here, and especially as more content is viewed on mobile/touch devices, direct manipulation provides a natural means for fine control of selection.

Under the hood, content analysis is required to identify events in the video to help drive the user interaction.  In this case, the analysis is implemented in javascript and runs in the browser on which the video is being played.  So efficient means of standard image analysis tools such as region segmentation, edge detection, and region tracking are required.  There’s a natural tradeoff between robustness and efficiency here that constrains the content processing techniques.

The interaction enabled by the system is probably best described in the video below:

Video Copy and Paste Demo

Go find Scott or Laurent in Florence or contact us for more information.

Mining the Video Past of Future Research: Is it worth a look?


Hi FXPAL blogosphere. Among the odds and ends I do at FXPAL is help people present their works with video. It also falls to me to archive the videos themselves. As I periodically move the video to new storage servers, I tend to look over “the old family album.” Our family is in the business of looking ahead at technology, so our album is pretty much all about that. Sometimes we hit, sometimes we miss. (One thing for sure about trying to make sense of the future is that the future’s judgement is pretty clear – once you get there.)

Among many other things, looking at family albums starts conversations. So here is the first installment in starting a blog conversation with these archive videos at the center. Where will it lead? Well, that’s what blogging is kind of good at, is it not?

Continue Reading