Blog Category: human-computer interaction

Improving User Interfaces for Robot Teleoperation


The FXPAL robotics research group has recently explored technologies for improving the usability of mobile telepresence robots. We evaluated a prototype head-tracked stereoscopic (HTS) teleoperation interface for a remote collaboration task. The results of this study indicate that using a HTS systems reduces task errors and improves the perceived collaboration success and
viewing experience.

We also developed a new focus plus context viewing technique for mobile robot teleoperation. This allows us to use wide-angle camera images
that proved rich contextual visual awareness of the robot’s surroundings while at the same time preserving a distortion-free region
in the middle of the camera view.

To this, we added a semi-automatic robot control method that allows operators to navigate the telepresence robot via a pointing and clicking directly on
the camera image feed. This through-the-screen interaction paradigm has the advantage of decoupling operators from the robot control loop, freeing them for
other tasks besides driving the robot.

As a result of this work, we presented two papers at the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). We obtained a best paper award for the paper “Look Where You’re Going: Visual Interfaces for Robot Teleoperation” in the Design category.

video text retouch


Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project.  One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content.  We are working towards this goal by exploiting two key tools.  First, we want to use real-time content analysis to expose useful structure within multimedia content.  Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation.  Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week.  Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines.  When a user clicks on a frame, a live cursor appears next to the nearest word.  At this point, users can alter text directly using the keyboard.  When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams.  We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

Ego-Centric vs. Exo-Centric Tracking and Interaction in Smart Spaces


In the recent paper published at SUI 2014,”Exploring Gestural Interaction in Smart Spaces using Head-Mounted Devices with Ego-Centric Sensing”, co-authored with Barry Kollee and Tony Dunnigan, we studied a prototype Head Mounted Device (HMD) that allows the interaction with external displays by input through spatial gestures.

In the paper, one of our goals was to expand the scope of interaction possibilities on HMDs, which are currently severely limited, if we consider Google Glass as a baseline. Glass only has a small touch pad, which is placed at an awkward position on the devices rim, at the user’s temple. The other input modalities Glass offers are eye blink input and voice recognition. While eye blink can be effective as a binary input mechanism, in many situations it is rather limited and could be considered socially awkward. Voice input suffers from recognition errors for non-native speakers of the input language and has considerable lag, as current Android-based devices, such as Google Glass, perform text-to-speech in the cloud. These problems were also observed in the main study of our paper.

We thus proposed three gestural selection techniques in order to extend the input capabilities of HMDs: (1) a head nod gesture, (2) a hand movement gesture and (3) a hand grasping gesture.

The following mock-up video shows the three proposed gestures used in a scenario depicting a material selection session in a (hypothetical) smart space used by architects:

EgoSense: Gestural Interaction in Smart Spaces using Head Mounted Devices with Ego-Centric Sensing from FX Palo Alto Laboratory on Vimeo.

We discounted the head nod gesture after a preliminary study showed a low user preference for such an input method. In a main study, we found that the two gestural techniques achieved performance similar to a baseline technique using the touch pad on Google Glass. However, we hypothesize that the spatial gestural techniques using direct manipulation may outperform the touch pad for larger numbers of selectable targets (in our study we had 12 targets in total), as secondary GUI navigation activities (i.e., scrolling a list view) are not required when using gestures.

In the paper, we also present some possibilities for ad-hoc control of large displays and automated indoor systems:

Ambient light control using spatial gestures tracked by via an HMD.

Ambient light control using spatial gestures tracked by via an HMD.

Considering the larger picture, our paper touches on the broader question of ego-centric vs exo-centric tracking: past work in smart spaces has mainly relied on external (exo-centric) tracking techniques, e.g., using depth sensors such as the Kinect for user tracking and interaction. As wearable devices get increasingly powerful and as depth sensor technology shrinks, it may, in the future, become more practical to users to bring their own sensors to a smart space. This has advantages in scalability: more users can be tracked in larger spaces, without additional investments in fixed tracking systems. Also, a larger number of spaces can be made interactive, as the users carry their sensing equipment from place to place.

Gesture Viewport: Interacting with Media Content Using Finger Gestures on Any Surface


At ICME 2014 in Chengdu, China, we presented a technical demo called “Gesture Viewport,” which is a projector-camera system that enables finger gesture interactions with media content on any surface. In the demo, we used a portable Pico projector to project a viewport widget (along with its content) onto a desktop and a Logitech webcam to monitor the viewport widget. We proposed a novel and computationally efficient finger localization method based on the detection of occlusion patterns inside a virtual “sensor” grid rendered in a layer on top of the viewport widget. We developed several robust interaction techniques to prevent unintentional gestures from occurring, to provide visual feedback to a user, and to minimize the interference of the “sensor” grid with the media content. We showed the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View. Click on the following link to watch a short video clip that illustrates these scenarios.

Many people who had seen the demo were impressed. They thought that the idea behind the demo, that is the proposed occlusion pattern based finger localization method, was very clever. That probably is a big reason why we won the Best Demo Award at ICME 2014. For more details of the demo, please refer to this paper.

SearchPanel: supporting exploratory search in regular search engines


People often use more than one query when searching for information. We revisit search results to re-find information and build an understanding of our search need through iterative explorations of query formulation. Unfortunately, these tasks are not well supported by search interfaces and web browsers. The only indication of our search process we get is a different colored link to pages we already have visited. In our previous research, we found that a simple query preview widget helped people formulate more successful queries and more efficiently explore the search results. However, the query preview widget would not work with regular search engines since it required back-end support. To bring support for exploratory search to common search engines, such as Google, Bing or Yahoo, we designed and built a Chrome browser plug-in, SearchPanel.

SearchPanel collects and visualizes information about the web pages retrieved in small panel next to the search results. With a glance, a searcher can see which web pages have been previously retrieved, visited and bookmarked. If a web page has a favicon, it is included in the bar (2) to help scanning and navigation of the search results. Each search result is represented as a bar in SearchPanel. The color of the bar (3) indicates retrieval status (teal = new, light blue = previously retrieved but not viewed, and dark blue = previously retrieved and viewed web page). The length of the bar (5) indicates how many times a web page has been visited; shorter bar indicates more visits. If a web page in the results list have previously been bookmarked, a yellow star is shown next to the bar (6). Users can easily re-run the same query with a different search engine by selecting one of the search engine buttons (1). When the user navigates to a web page linked in the search results, a white circle (4) is shown next to the bar representing that search result. This circle persists even if the user continues to follow links away from the web page linked in the search results. Complex2_numbers

When moving away from the search page, SearchPanel stays put and provides a short cut for accessing the search results. The search result being explored is indicated in SearchPanel by a circle. Moving the mouse over a bar in SearchPanel when not on the search page, displays the search result snippet.


We evaluated SearchPanel in a real world deployment and found that appears to have been primarily used for complex information needs, in search sessions with long durations and high numbers of queries. For search session with single queries, we found very little use of SearchPanel. Based on our evaluation, we conclude that SearchPanel appears to be used in the way it was designed; when it is not needed it is out of the way and not used, but when one simple query does not answer the search need, SearchPanel is used for supporting the information seeking process. More details about SearchPanel can be found in our SIGIR 2014 paper.

AirAuth: Authentication through In-Air Gestures Instead of Passwords


At the CHI 2014 conference, we demonstrated a new prototype authentication system, AirAuth, that explores the use of in-air gestures for authentication purposes as an alternative to password-based entry.

Previous work has shown that passwords or PINs as an authentication mechanism have usability issues that ultimately lead to a compromise in security. For instance, as the number of services to authenticate to grows, users use variations of basic passwords, which are easier to remember, thus making their accounts susceptible to attack if one is compromised.

On mobile devices, smudge attacks and shoulder surfing attacks pose a threat to authentication, as finger movements on a touch screen are easy to record visually and to replicate.

AirAuth addresses these issues by replacing password entry with a gesture. Motor memory makes it a simple task for most users to remember their gesture. Furthermore, since we track multiple points on the user’s hands, we do obtain tracking information that is unique to the physical appearance of the legitimate user, so there is an implicit biometric built into AirAuth. Smudge attacks are averted due to the touchless gesture entry and a user study we conducted shows that AirAuth is also quite resistant towards camera-based shoulder surfing attacks.

Our demo at CHI showed the enrollment and authentication phases of our system. We gave attendees the opportunity to enroll in our system and check AirAuth’s capabilities to recognize their gestures. We got great responses from the attendees and obtained enrollment gestures from a number of them. We plan to use these enrollment gestures to evaluate AirAuth’s accuracy in field conditions.

Improving the Expressiveness of Touch Input


Touch input is now the preferred input method on mobile devices such as smartphones or tablets. Touch is also gaining traction in the desktop segment and is also common for interaction with large table or wall-based displays. At present, the majority of touch displays can detect solely the touch location of a user input. Some capacitive touch screens can also report the contact area of a touch, but usually, no further information about individual touch inputs is available to developers of mobile applications.

It would, however, be beneficial to capture further properties of the user’s touch, for instance the finger’s rotation around the vertical axis (i.e., the axis orthogonal to the plane of the touch screen) as well as its tilt (see images above). Obtaining rotation and tilt information for a touch would allow for expressive localized input gestures as well as new types of on-screen widgets that make use of the additional local input degrees of freedom.

Having finger pose information together with touches adds additional local degrees of freedom of input for each touch location. This, for instance, allows the user interface designer to remap established multi-touch gestures such as pinch-to-zoom to other user interface functions or to free up screen space by allowing input (e.g., adjusting a slider value, scrolling a list, panning a map view, enlarging a picture) to be performed at a single touch location that usually need (multi-) touch gestures that require a significant amount of screen space. New graphical user interface widgets that make use of finger pose information, such as rolling context menus, hidden flaps or occlusion-aware widgets have also been suggested.

Our PointPose prototype performs finger pose estimation at the location of touch using a short-range depth sensor viewing the touch screen of a mobile device. We use the point cloud generated by the depth sensor for finger pose estimation. PointPose estimates the finger pose of a user touch by fitting a cylindrical model to the subset of the point that corresponds to the user’s finger. We use the spatial location of the user’s touch to seed the search for the subset of the point cloud representing the user’s finger.

One advantage of our approach is that it does not require complex external tracking hardware (as in related work), and external computation is unnecessary as the finger pose extraction algorithm is efficient enough to run directly on the mobile device. This makes PointPose ideal for prototyping and developing novel mobile user interfaces that use finger pose estimation.

Looking ahead

on Comments (4)

It is reasonably well-known that people who examine search results often don’t go past the first few hits, perhaps stopping at the “fold” or at the end of the first page. It’s a habit we’ve acquired due to high-quality results to precision-oriented information needs. Google has trained us well.

But this habit may not always be useful when confronted with uncommon, recall-oriented, information needs. That is, when doing research. Looking only at the top few documents places too much trust in the ranking algorithm. In our SIGIR 2013 paper, we investigated what happens when a light-weight preview mechanism gives searchers a glimpse at the distribution of documents — new, re-retrieved but not seen, and seen — in the query they are about to execute.

Continue Reading

In Defense of the Skeuomorph, or Maybe Not…


Hard drive iconJony Ive is a fantastic designer. As a rule, his vision for a device sets the trend for that entire class of devices. Apparently, Jony Ive hates skeuomorphic design elements. Skeuomorphs are those sometimes corny bits of realism some designers add to user interfaces. These design elements reference an application’s analog embodiment. Apple’s desktop and mobile interfaces are littered with them. Their notepad application looks like a notepad. Hell, the hard drive icon on my desktop is a very nice rendering of the hard drive that is actually in my desktop.

Continue Reading

Invited talk at CATCH


Thanks to Frank Nack and Marc Bron, last week I had the opportunity to give a talk in The Netherlands at a NWO CATCH event organized by BRIDGE. NWO is the Dutch national research organization; BRIDGE is a project that explores access to television archives; and CATCH stands for Continuous Access To Cultural Heritage, which is something like an umbrella organization. The meeting was held at the Netherlands Institute for Sound and Vision in Hilversum, a rather interesting building.

Interior of the Netherlands Institute for Sound and Vision in Hilversum

Although it was a long way to go for a one-day event, I am grateful to Frank and Marc for the invitation, for their efforts as hosts, and for all the great discussion during the talk, in the breaks between sessions, and, of course, over beers in the evening. It’s great to be able to make such connections; hopefully more collaboration will follow.

For those interested, here are the slides of my presentation, which expands a bit on my earlier blog post about using the history of interaction to improve exploratory search.