# Using Stereo Vision to Operate Mobile Telepresence Robots

on

The use of mobile telepresence robots (MTRs) is increasing. Very few MTRs have autonomous navigation systems. Thus teleoperation is usually still a manual task, and often has user experience problems. We believe that this may be due to (1) the fixed viewpoint and limited field of view of a 2D camera system and (2) the capability of judging distances due to lack of depth perception.

To improve the experience of teleoperating the robot, we evaluated the use of stereo video coupled with a head-tracked and head-mounted display.

To do this, we installed a brushless gimbal with a stereo camera pair on a robot platform. We used an Oculus Rift (DK1) device for visualization and head tracking.

Stereobot telepresence robot (left) and stereo gimbal system (right).

We conducted a preliminary user study to gather qualitative feedback about telepresence navigation tasks using stereo vs. a 2D camera feed, and high vs. low camera placement. In a simulated telepresence scenario, participants were asked to drive the robot from an office to a meeting location, have conversation with a tester, then drive back to the starting location.

An ANOVA on System Usability Scale (SUS) scores with visualization type and camera placement as factors results in a significant effect of visualization type on the score. However, we observed a higher SUS score for navigation based on a 2D camera feed. The camera placement height did not show a significant effect.

The following two main reasons could have caused the lower ratings for stereo: (1) about half of the users experienced at least some form of disorientation. This might have been due to their unfamiliarity with immersive VR headsets but also due the sensory distortion effect of being immersed visually in a moving environment while other bodily senses report sitting still. (2) the video transmission quality was not optimal due to interference of the analog video transmission signal by objects in the building and due to the relatively low display resolution of the Oculus Rift DK1 device.

In the future we intend to work on improving the visual quality of the stereo output by using better video transmission and head-worn display. We furthermore intend to evaluate robot navigation tasks using a full VR view. This view will make use of the robot’s sensors and localization system in order to display the robot correctly within a virtual representation of our building.

# More evidence of the value of HMD capture

on

At next week’s CSCW 2015 conference, a group from University of Wisconsin-Madison will present an interesting piece of work related to the last post: “Handheld or Handsfree? Remote Collaboration via Lightweight Head-Mounted Displays and Handheld Devices”. Similar to our work, the authors compared the use of Google Glass to a tablet-based interface for two different construction tasks: one simple and one more complex. While in our case study participants created tutorials to be viewed at a later time, this test explored synchronous collaboration.

The authors found that Google Glass was helpful for the more difficult task, enabling better and more frequent communication, while for the simpler task the results were mixed. This more-or-less agrees with our findings: HMDs are helpful for capturing and communicating complicated tasks but less so for table-top tasks.

Another key difference between this work and ours is that the authors relied on Google Hangouts to stream videos. However, as the authors write, “the HMD interface of Google Hangouts used in our study did not offer [live preview feedback],” a key feature for any media capture application.

At FXPAL, we build systems when we are limited by off-the-shelf technology. So when we discovered a related capture feedback issue in early pilots we were able to quickly fix it in our tool. Of course in our case the technology was much simpler because we did not need to implement video streaming. However, since this paper was published we have developed mechanisms to stream video from Glass, or any Android device, using open WebRTC protocols. More than that, our framework can analyze incoming frames and then stream out arbitrary image data, potentially allowing us to implement many of the design implications the authors describe in the paper’s discussion section.

# Head-mounted capture and access with ShowHow

on

Our IEEE Pervasive paper on head-mounted capture for multimedia tutorials was recently accepted and is currently in press. We are excited to share some our findings here.

Creating multimedia tutorials requires two distinct steps: capture and editing. While editing, authors have the opportunity to devote their full attention to the task at hand. Capture is different. In the best case, capture should be completely unobtrusive so that the author can focus exclusively on the task being captured. But this can be difficult to achieve with handheld devices, especially if the task requires that the tutorial author move around an object and use both hands simultaneously (e.g., showing how to replace a bike derailleur).

For this reason, we extended our ShowHow multimedia tutorial system to support head-mounted capture. Our first approach was simple: a modified pair of glasses with a Looxcie camera and laser guide attached. While this approach interfered with the user’s vision less than other solutions, such as a full augmented reality system, it nonetheless suffered from an array of problems: it was bulky, it was difficult to control, and without a display feedback of the captured area it was hard to frame videos and photos.

Luckily, Google Glass launched around this time. With an onboard camera, a touch panel, and display, it seemed an excellent choice for head-mounted capture.

Our video application to the Glass Explorers program

To test this, we built an app for Google Glass that requires minimal attention to the capture device and instead allows the author to focus on creating the tutorial content. In our paper, we describe a study comparing standalone capture (camera on tripod) versus head-mounted (Google Glass) capture. Details are in the paper, but in short we found that tutorial authors prefer wearable capture devices, especially when recording activities involving larger objects in non-tabletop environments.

The ShowHow Google Glass capture app

Finally, based on the success of Glass for capture we built and tested an access app as well. A detailed description of the tool, as well as another study we ran testing its efficacy for viewing multimedia tutorials, is the subject of an upcoming paper. Stay tuned.

The ShowHow Google Glass access app

# MixMeet

on

At FXPAL, we build and evaluate systems that make multimedia content easier to capture, access, and manipulate. In the Interactive Media group we are currently focusing on remote work and distributed meetings in particular. On one hand, meetings can be inefficient at best and a flat-out boring, waste-of-time at worst. However, there are some key benefits to meetings, especially those that are more ad hoc and driven by specific, concrete goals. More and more meetings are held with remote workers via multimedia-rich interfaces (such as HipChat and Slack).  These systems augment web-based communication with lightweight content sharing to reduce communication overhead while helping teams focus on immediate tasks.

We are developing a tool, MixMeet, to make lightweight, multimedia meetings more dynamic, flexible, and hopefully more effective. MixMeet is a web-based collaboration tool designed to support content interaction and extraction for use in both live, synchronous meetings as well as asynchronous group work. MixMeet is a pure web system that uses the WebRTC framework to create video connections. It supports live keyframe archiving and navigation, content-based markup, and the ability to copy-and-paste content to personal or shared notes. Each meeting participant can flexibly interact with all other clients’ shared screen or webcam content.  A backend server can be configured to archive keyframes as well as record each user’s stream.

Our vision for MixMeet is to make it easy to mark up and reuse content from meetings, and make collaboration over visual content a natural part of web-based conferencing. As you can see from the video below, we have made some progress toward this goal. However, we know there are many issues with remote, multimedia-rich work that we don’t yet fully understand. To that end, we are currently conducting a study of remote videoconferencing tools. If your group uses any remote collaboration tools with distributed groups please fill out our survey.

# on automation and tacit knowledge

on

We hear a lot about how computers are replacing even white collar jobs. Unfortunately, often left behind when automating these kinds of processes is tacit knowledge that, while perhaps not strictly necessary to generate a solution, can nonetheless improve results. In particular, many professionals rely upon years of experience to guide designs in ways that are largely invisible to non-experts.

One of these areas of automation is document layout or reflow in which a system attempts to fit text and image content into a given format. Usually such systems operate using templates and adjustable constraints to fit content into new formats. For example, the automated system might adjust font size, table and image sizes, gutter size, kerning, tracking, leading, etc. in different ways to match a loosely defined output style. These approaches can certainly be useful, especially for targeting output to devices with arbitrary screen sizes and resolutions. One of the largest problems, however, is that these algorithms often ignore what might have been a considerable effort by the writers, editors, and backshop designers to create a visual layout that effectively conveys the material. Often designers want detailed control over many of the structural elements that such algorithms adjust.

For this reason I was impressed with Hailpern et al.’s work at DocEng 2014 on document truncation and pagination for news articles. In these works, the authors’ systems analyze the text of an article to determine pagination and truncation breakpoints in news articles that correspond to natural boundaries in articles between high-level, summary content and more detailed content. This derives from an observation that journalists tend to write articles in “inverted pyramid” style in which the most newsworthy, summary information appears near the beginning with details toward the middle and background info toward the end. This is a critical observation in no small part because it means that popular newswriting bears little resemblance to academic writing. (Perhaps what sets this work apart from others is that the authors employed a basic tenet of human-computer interaction: the experiences of the system developer are a poor proxy for the experiences of other stakeholders.)

Foundry, which Retelny et al. presented at UIST 2014, takes an altogether different approach. This system, rather than automating tasks, helps bring diverse experts together in a modular, flexible way. The system helps the user coordinate the recruitment of domain experts into a staged workflow toward the creation of a complex product, such as an app or training video. The tool also allows rapid reconfiguration. One can imagine that this system could be extended to take advantage of not only domain experts but also people with different levels of expertise — some “stages” could even be automated. This approach is somewhat similar to the basic ideas in NudgeCam, in which the system incorporated general video guidelines from video-production experts, templates designed by experts in the particular domain of interest, novice users, and automated post hoc techniques to improve the quality of recorded video.

The goal of most software is to improve a product’s quality as well as efficiency with which it is produced. We should keep in mind that this is often best accomplished not by systems designed to replace humans but rather those developed to best leverage people’s tacit knowledge.

# video text retouch

on

Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project.  One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content.  We are working towards this goal by exploiting two key tools.  First, we want to use real-time content analysis to expose useful structure within multimedia content.  Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation.  Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week.  Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines.  When a user clicks on a frame, a live cursor appears next to the nearest word.  At this point, users can alter text directly using the keyboard.  When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams.  We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

# Ego-Centric vs. Exo-Centric Tracking and Interaction in Smart Spaces

on

In the recent paper published at SUI 2014,”Exploring Gestural Interaction in Smart Spaces using Head-Mounted Devices with Ego-Centric Sensing”, co-authored with Barry Kollee and Tony Dunnigan, we studied a prototype Head Mounted Device (HMD) that allows the interaction with external displays by input through spatial gestures.

In the paper, one of our goals was to expand the scope of interaction possibilities on HMDs, which are currently severely limited, if we consider Google Glass as a baseline. Glass only has a small touch pad, which is placed at an awkward position on the devices rim, at the user’s temple. The other input modalities Glass offers are eye blink input and voice recognition. While eye blink can be effective as a binary input mechanism, in many situations it is rather limited and could be considered socially awkward. Voice input suffers from recognition errors for non-native speakers of the input language and has considerable lag, as current Android-based devices, such as Google Glass, perform text-to-speech in the cloud. These problems were also observed in the main study of our paper.

We thus proposed three gestural selection techniques in order to extend the input capabilities of HMDs: (1) a head nod gesture, (2) a hand movement gesture and (3) a hand grasping gesture.

The following mock-up video shows the three proposed gestures used in a scenario depicting a material selection session in a (hypothetical) smart space used by architects:

We discounted the head nod gesture after a preliminary study showed a low user preference for such an input method. In a main study, we found that the two gestural techniques achieved performance similar to a baseline technique using the touch pad on Google Glass. However, we hypothesize that the spatial gestural techniques using direct manipulation may outperform the touch pad for larger numbers of selectable targets (in our study we had 12 targets in total), as secondary GUI navigation activities (i.e., scrolling a list view) are not required when using gestures.

In the paper, we also present some possibilities for ad-hoc control of large displays and automated indoor systems:

Ambient light control using spatial gestures tracked by via an HMD.

Considering the larger picture, our paper touches on the broader question of ego-centric vs exo-centric tracking: past work in smart spaces has mainly relied on external (exo-centric) tracking techniques, e.g., using depth sensors such as the Kinect for user tracking and interaction. As wearable devices get increasingly powerful and as depth sensor technology shrinks, it may, in the future, become more practical to users to bring their own sensors to a smart space. This has advantages in scalability: more users can be tracked in larger spaces, without additional investments in fixed tracking systems. Also, a larger number of spaces can be made interactive, as the users carry their sensing equipment from place to place.

# Information Interaction in Context 2014

on

I asked FXPAL alumni Jeremy Pickens to contribute a post on the best paper award at IIiX which is named after our late colleague Gene Golovchinsky.  For me, the episode Jeremy recounts exemplifies Gene’s willingness and generosity in helping others work though research questions.  The rest of this post is written by Jeremy.

I recently had the opportunity to attend the Information Interaction in Context conference at the University of Regensburg.  It is a conference which attempts to bring together the systems and the user perspective on information retrieval and information seeking.  In short, it was exactly the type of conference at which our colleague Gene Golovchinsky was quite at home.  In fact, Gene had been one of the chairs of the conference before his passing last year.  The IIiX organizers made him an honorary chair.  During his time as chair, Gene secured FXPAL’s sponsorship of the conference including the honorarium that accompanied the Best Paper award.  The conference organizers decided to officially give the award in Gene’s memory, and as a former FXPAL employee, I was asked to present the award and to say a few words about Gene.

I began by sharing who I knew Gene to be through the lens of our first meeting.  It was 1998.  Maybe 1999.  Let’s say 1998.  I was a young grad student in the Information Retrieval lab at UMass Amherst.  Gene had recently convinced FXPAL to sponsor my advisor’s Industrial Advisory Board meeting.  This meant that once a year, the lab would put together a poster session to give IAB members a sneak preview of the upcoming research results before they appeared anywhere else.

Well, at that time, I was kinda an odd duck in the lab because I had started doing music Information Retrieval when most of my colleagues were working on text.  So there I am at the IAB poster session, with all these commercial, industry sponsors who have flown in from all over the country to get new ideas about how to improve their text search engines…and I’m talking about melodies and chords.  Do you know that look, when someone sees you but really does not want to talk with you?  When their eyes meet yours, and then keep on scanning, as if to pretend that they were looking past you the whole time?  For the first hour that’s how I felt.

Until Gene.

Now, I’m fairly sure that he really was not interested in music IR.  But not only did Gene stop and hear what I had to say, but he engaged.  Before I knew it, half an hour (or at least it felt like it) had passed by, and I’d had one of those great engaging Gene discussions that I would, a few years later when FXPAL hired me, have a whole lot more of.  Complete with full Gene eye twinkle at every new idea that we batted around.  Gene had this way of conducting a research discussion in which he could both share (give) ideas to you, and elicit ideas from you, in a way that I can only describe as true collaboration.

After the conference dinner and presentation had concluded, there were a number of people that approached me and shared very similar stories about their interactions with Gene.  And a number of people who expressed the sentiment that they wished they’d had the opportunity to know him.

I should also note that the Best Paper award went to Kathy Brennan, Diane Kelly, and Jamie Arguello, for their paper on “The Effect of Cognitive Abilities on Information Search for Tasks of Varying Levels of Complexity“. Neither I nor FXPAL had a hand in deciding who the best paper recipient was to be; that task went to the conference organizers.  But in what I find to be a touching coincidence, one of the paper’s authors, Diane Kelly, was actually Gene’s summer intern at FXPAL back in the early 2000s.  He touched a lot of people, and will be sorely missed.  I miss him.

# LoCo: a framework for indoor location of mobile devices

on

Last year, we initiated the LoCo project on indoor location.  The LoCo page has more information, but our central goal is to provide highly accurate, room-level location information to enable indoor location services to complement the location services built on GPS outdoors.

Last week, we presented our initial results on the work at Ubicomp 2014.  In our paper, we introduce a new approach to room-level location based on supervised classification.  Specifically, we use boosting in a one-versus-all formulation to enable highly accurate classification based on simple features derived from Wi-Fi received signal strength (RSSI) measures.  This approach offloads the bulk of the complexity to an offline training procedure, and the resulting classifier is sufficiently simple to be run on a mobile client directly.  We use a simple and robust feature set based on pairwise RSSI margin to both address Wi-Fi RSSI volatility.

$h_m(X) = \begin{cases} 1 & X(b_m^{(1)}) - X(b_m^{(2)}) \geq \theta_m \\ 0 & \text{otherwise} \end{cases}$

The equation above shows an example weak learner which simply looks at two elements in an RSSI scan and compares their difference against a threshold.  The final strong classifier for each room is a weighted combination of a set of weak learners greedily selected to discriminate that room.  The feature is designed to express the ordering of RSSI values observed for specific access points, and a flexible reliance on the difference between them, and the threshold $\theta_m$ is determined in training.  An additional benefit of this choice is that processing a subset of the RSSI scan according to the selected weak learners further reduces the required computation.  Comparing against the kNN matching approach used in RedPin [Bolliger, 2008], our results show competitive performance with substantially reduced complexity.  The Table below shows cross validation results from the paper for two data sets collected in our office.  The classification time appears in the rightmost column.

We are excited about the early progress we’ve made on this project and look forward to building out our indoor location system in several directions in the near future.  But more than that, we look forward to building new location driven applications exploiting this technique which can leverage existing infrastructure (Wi-Fi networks) and devices (cell phones) we already use.

# Gesture Viewport: Interacting with Media Content Using Finger Gestures on Any Surface

on

At ICME 2014 in Chengdu, China, we presented a technical demo called “Gesture Viewport,” which is a projector-camera system that enables finger gesture interactions with media content on any surface. In the demo, we used a portable Pico projector to project a viewport widget (along with its content) onto a desktop and a Logitech webcam to monitor the viewport widget. We proposed a novel and computationally efficient finger localization method based on the detection of occlusion patterns inside a virtual “sensor” grid rendered in a layer on top of the viewport widget. We developed several robust interaction techniques to prevent unintentional gestures from occurring, to provide visual feedback to a user, and to minimize the interference of the “sensor” grid with the media content. We showed the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View. Click on the following link to watch a short video clip that illustrates these scenarios.

Many people who had seen the demo were impressed. They thought that the idea behind the demo, that is the proposed occlusion pattern based finger localization method, was very clever. That probably is a big reason why we won the Best Demo Award at ICME 2014. For more details of the demo, please refer to this paper.