Blog Category: Research

DocHandles @ DocEng 2017

on Comments (0)

The conversational documents group at FXPAL is helping users interact with document content using the interface that best matches their current context and without worrying about the structure of underlying documents. With our system, users should be able to refer to figures, charts, and sections of their work documents seamlessly in a variety of collaboration interfaces to better communicate with their colleagues.


To achieve this goal, we are developing tools for understanding, repurposing, and manipulating document structure. The DocHandles work, which we will present at DocEng 2017, is a first step in this direction. With this tool a user can type, for example, “@fig2” into their multimedia chat tool to see a list of recommended figures extracted from recently shared documents. In this case, suggestions returned correspond to figures labeled “figure 2” in the most recently discussed documents in the chat, along with the document filename or title and caption. Users can then select their desired figure, which is automatically injected into the chat.

Please come see our presentation in Session 7 (User Interactions) at 17:45 on September 5th on to find out more about this system as well as some of our future plans for conversational documents.

Improving User Interfaces for Robot Teleoperation


The FXPAL robotics research group has recently explored technologies for improving the usability of mobile telepresence robots. We evaluated a prototype head-tracked stereoscopic (HTS) teleoperation interface for a remote collaboration task. The results of this study indicate that using a HTS systems reduces task errors and improves the perceived collaboration success and
viewing experience.

We also developed a new focus plus context viewing technique for mobile robot teleoperation. This allows us to use wide-angle camera images
that proved rich contextual visual awareness of the robot’s surroundings while at the same time preserving a distortion-free region
in the middle of the camera view.

To this, we added a semi-automatic robot control method that allows operators to navigate the telepresence robot via a pointing and clicking directly on
the camera image feed. This through-the-screen interaction paradigm has the advantage of decoupling operators from the robot control loop, freeing them for
other tasks besides driving the robot.

As a result of this work, we presented two papers at the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). We obtained a best paper award for the paper “Look Where You’re Going: Visual Interfaces for Robot Teleoperation” in the Design category.

DocuGram at DocEng


Teleconferencing is now a nearly ubiquitous aspect of modern work. We routinely use apps such as Google Hangouts or Skype to present work or discuss documents with remote colleagues. Unfortunately, sharing source documents is not always as seamless. For example, a meeting participant might share content via screencast that she has access to, but that the remote participant does not. Remote participants may also not have the right software to open the source document, or the content shared might be only a small section of a large document that is difficult to share.

Later this week in Vienna, we will present our work at DocEng on DocuGram, a tool we developed at FXPAL to help address these issues. DocuGram can capture and analyze shared screen content to automatically reconstitute documents. Furthermore, it can capture and integrate annotations and voice notes made as the content is shared.

The first video below describes DocuGram, and the second shows how we have integrated it into our teleconferencing tool, MixMeet. Check it out, and be sure to catch our talk on Friday, September 16th at 10:00AM.

FXPAL at Mobile HCI 2016


Early next week, Ville Mäkelä and Jennifer Marlow will present our work at Mobile HCI on tools we developed at FXPAL to support distributed workers. The paper, “Bringing mobile into meetings: Enhancing distributed meeting participation on smartwatches and mobile phones”, presents the design, development, and evaluation of two applications, MixMeetWear and MeetingMate, that aim to help users in non-standard contexts participate in meetings.

The videos below show the basic functionality of the two systems. If you are in Florence for Mobile HCI, please stop by their presentation on Thursday, September 8, in the 2:00-3:30 session (in Sala Verde) to get the full story.


MixMeet: Live searching and browsing


Knowledge work is changing fast. Recent trends in increased teleconferencing bandwidth, the ubiquitous integration of “pads and tabs” into workaday life, and new expectations of workplace flexibility have precipitated an explosion of applications designed to help people collaborate from different places, times, and situations.

Over the last several months the MixMeet team observed and interviewed members of many different work teams in small-to-medium sized businesses that rely on remote collaboration technologies. In work we will present at ACM CSCW 2016, we found that despite the widespread adoption of frameworks designed to integrate information from a medley of devices and apps (such as Slack), employees utilize a surprisingly diverse but unintegrated set of tools to collaborate and get work done. People will hold meetings in one app while relying on another to share documents, or share some content live during a meeting while using other tools to put together multimedia documents to share later. In our CSCW paper, we highlight many reasons for this increasing diversification of work practice. But one issue that stands out is that videoconferencing tools tend not to support archiving and retrieving disparate information. Furthermore, tools that do offer archiving do not provide mechanisms for highlighting and finding the most important information.

In work we will present later this fall at ACM MM 2015 and ACM DocEng 2015, we describe new MixMeet features that address some of these concerns so that users can browse and search the contents of live meetings to retrieve rapidly previously shared content. These new features take advantage of MixMeet’s live processing pipeline to determine actions users take inside live document streams. In particular, the system monitors text and cursor motion in order to detect text edits, selections, and mouse gestures. MixMeet applies these extra signals to user searches to improve the quality of retrieved results and allow users to quickly filter a large archive of recorded meeting data to find relevant information.

In our ACM MM paper (and toward the end of the above video) we also describe how MixMeet supports table-top videoconferencing devices, such as Kubi. In current work, we are developing multiple tools to extend our support to other devices and meeting situations. Publications describing these new efforts are in the pipeline: stay tuned.

HMD and specialization


Google Glass’ semi-demise has become a topic of considerable interest lately. Alexander Sommer at WT-Vox takes the view that it was a courageous “public beta” and “a PR nightmare” but also well received in specialized situations where the application suits the device, as in Scott’s post below. IMO, a pretty good summary.

(I notice that Sony has jumped in with SmartEyeglass – which have been called “too dorky to be believed…” Still, one person’s “dorky” is another person’s “specialized.”)

Visually Interpreting Names as Demographic Attributes


In the AAAI 2015 conference, we presented the work “Visually Interpreting Names as Demographic Attributes by Exploiting Click-Through Data,” a collaboration with a research team in National Taiwan University. This study aims to automatically associate a name and its likely demographic attributes, e.g., gender and ethnicity. More specifically, the associations are driven by web-scale search logs that are collected via a search engine when internet users retrieve images.

Demographic attributes are vital to semantically characterize a person or a community. This makes it valuable for marketing, personalization, face retrieval, social computing and more human-centric research. Since users tend to keep their online profiles private, name is the most reachable piece of personal information among these contexts. The problem we address is – given a name, associating and predicting its likely demographic attributes. For example, given a person named “Amy Liu,” the person is likely an Asian female. Name makes the first impression of a person because naming conventions are strongly influenced by culture, e.g., first name and gender, last name and location of origin. Typically, the associations between names and the two attributes are made by referring to demographics maintained by governments or by manually labeling attributes based on the given personal information (e.g., photo). The former is limited in regional census data. The latter has major concerns in time and cost when it adapts to large-scale data.

Different from prior approaches, we propose to exploit click-throughs between text queries and retrieved face images in web search logs, where the names are extracted from queries and the attributes are detected from face images automatically. In this paper, a click-through means when one of the URLs returned by a text query has been clicked by a user to view a web image it directs to. The mechanism delivers two messages, (1) the association between a query and an image is based on viewers’ clicks, that is, human intelligence from web-scale users; (2) users may have considerable knowledge to the associations because they might be partially aware of what they are looking for and search engines are getting much better at satisfying user intent. Both characteristics of click-throughs reduce concerns of incorrect associations. Moreover, the Internet users’ knowledge enables discovering name-attribute associations with high generality to more countries.

In the experiments, the proposed name-attribute associations are demonstrated with competitive accuracy compared to using manual labeling. It also benefits profiling social media users and keyword-based face image retrieval, especially the adaption to unseen names. This is the first work to interpret a name to demographic attributes in visual-data-driven manner using web search logs. In the future, we are going to extend the visual interpretation of an abstract name to more targets for which naming conventions are highly influenced by visual appearance.

Using Stereo Vision to Operate Mobile Telepresence Robots


The use of mobile telepresence robots (MTRs) is increasing. Very few MTRs have autonomous navigation systems. Thus teleoperation is usually still a manual task, and often has user experience problems. We believe that this may be due to (1) the fixed viewpoint and limited field of view of a 2D camera system and (2) the capability of judging distances due to lack of depth perception.

To improve the experience of teleoperating the robot, we evaluated the use of stereo video coupled with a head-tracked and head-mounted display.

To do this, we installed a brushless gimbal with a stereo camera pair on a robot platform. We used an Oculus Rift (DK1) device for visualization and head tracking.

StereoBot and Gimbal.

Stereobot telepresence robot (left) and stereo gimbal system (right).

We conducted a preliminary user study to gather qualitative feedback about telepresence navigation tasks using stereo vs. a 2D camera feed, and high vs. low camera placement. In a simulated telepresence scenario, participants were asked to drive the robot from an office to a meeting location, have conversation with a tester, then drive back to the starting location.

An ANOVA on System Usability Scale (SUS) scores with visualization type and camera placement as factors results in a significant effect of visualization type on the score. However, we observed a higher SUS score for navigation based on a 2D camera feed. The camera placement height did not show a significant effect.

The following two main reasons could have caused the lower ratings for stereo: (1) about half of the users experienced at least some form of disorientation. This might have been due to their unfamiliarity with immersive VR headsets but also due the sensory distortion effect of being immersed visually in a moving environment while other bodily senses report sitting still. (2) the video transmission quality was not optimal due to interference of the analog video transmission signal by objects in the building and due to the relatively low display resolution of the Oculus Rift DK1 device.

In the future we intend to work on improving the visual quality of the stereo output by using better video transmission and head-worn display. We furthermore intend to evaluate robot navigation tasks using a full VR view. This view will make use of the robot’s sensors and localization system in order to display the robot correctly within a virtual representation of our building.

More evidence of the value of HMD capture


At next week’s CSCW 2015 conference, a group from University of Wisconsin-Madison will present an interesting piece of work related to the last post: “Handheld or Handsfree? Remote Collaboration via Lightweight Head-Mounted Displays and Handheld Devices”. Similar to our work, the authors compared the use of Google Glass to a tablet-based interface for two different construction tasks: one simple and one more complex. While in our case study participants created tutorials to be viewed at a later time, this test explored synchronous collaboration.

The authors found that Google Glass was helpful for the more difficult task, enabling better and more frequent communication, while for the simpler task the results were mixed. This more-or-less agrees with our findings: HMDs are helpful for capturing and communicating complicated tasks but less so for table-top tasks.

Another key difference between this work and ours is that the authors relied on Google Hangouts to stream videos. However, as the authors write, “the HMD interface of Google Hangouts used in our study did not offer [live preview feedback],” a key feature for any media capture application.

At FXPAL, we build systems when we are limited by off-the-shelf technology. So when we discovered a related capture feedback issue in early pilots we were able to quickly fix it in our tool. Of course in our case the technology was much simpler because we did not need to implement video streaming. However, since this paper was published we have developed mechanisms to stream video from Glass, or any Android device, using open WebRTC protocols. More than that, our framework can analyze incoming frames and then stream out arbitrary image data, potentially allowing us to implement many of the design implications the authors describe in the paper’s discussion section.

Head-mounted capture and access with ShowHow


Our IEEE Pervasive paper on head-mounted capture for multimedia tutorials was recently accepted and is currently in press. We are excited to share some our findings here.

Creating multimedia tutorials requires two distinct steps: capture and editing. While editing, authors have the opportunity to devote their full attention to the task at hand. Capture is different. In the best case, capture should be completely unobtrusive so that the author can focus exclusively on the task being captured. But this can be difficult to achieve with handheld devices, especially if the task requires that the tutorial author move around an object and use both hands simultaneously (e.g., showing how to replace a bike derailleur).

For this reason, we extended our ShowHow multimedia tutorial system to support head-mounted capture. Our first approach was simple: a modified pair of glasses with a Looxcie camera and laser guide attached. While this approach interfered with the user’s vision less than other solutions, such as a full augmented reality system, it nonetheless suffered from an array of problems: it was bulky, it was difficult to control, and without a display feedback of the captured area it was hard to frame videos and photos.

Picture2Our first head-mounted capture prototype

Luckily, Google Glass launched around this time. With an onboard camera, a touch panel, and display, it seemed an excellent choice for head-mounted capture.

Our video application to the Glass Explorers program

To test this, we built an app for Google Glass that requires minimal attention to the capture device and instead allows the author to focus on creating the tutorial content. In our paper, we describe a study comparing standalone capture (camera on tripod) versus head-mounted (Google Glass) capture. Details are in the paper, but in short we found that tutorial authors prefer wearable capture devices, especially when recording activities involving larger objects in non-tabletop environments.

The ShowHow Google Glass capture app

Finally, based on the success of Glass for capture we built and tested an access app as well. A detailed description of the tool, as well as another study we ran testing its efficacy for viewing multimedia tutorials, is the subject of an upcoming paper. Stay tuned.

The ShowHow Google Glass access app