Google Glass’ semi-demise has become a topic of considerable interest lately. Alexander Sommer at WT-Vox takes the view that it was a courageous “public beta” and “a PR nightmare” but also well received in specialized situations where the application suits the device, as in Scott’s post below. IMO, a pretty good summary.
Blog Category: Research
In the AAAI 2015 conference, we presented the work “Visually Interpreting Names as Demographic Attributes by Exploiting Click-Through Data,” a collaboration with a research team in National Taiwan University. This study aims to automatically associate a name and its likely demographic attributes, e.g., gender and ethnicity. More specifically, the associations are driven by web-scale search logs that are collected via a search engine when internet users retrieve images.
Demographic attributes are vital to semantically characterize a person or a community. This makes it valuable for marketing, personalization, face retrieval, social computing and more human-centric research. Since users tend to keep their online profiles private, name is the most reachable piece of personal information among these contexts. The problem we address is – given a name, associating and predicting its likely demographic attributes. For example, given a person named “Amy Liu,” the person is likely an Asian female. Name makes the first impression of a person because naming conventions are strongly influenced by culture, e.g., first name and gender, last name and location of origin. Typically, the associations between names and the two attributes are made by referring to demographics maintained by governments or by manually labeling attributes based on the given personal information (e.g., photo). The former is limited in regional census data. The latter has major concerns in time and cost when it adapts to large-scale data.
Different from prior approaches, we propose to exploit click-throughs between text queries and retrieved face images in web search logs, where the names are extracted from queries and the attributes are detected from face images automatically. In this paper, a click-through means when one of the URLs returned by a text query has been clicked by a user to view a web image it directs to. The mechanism delivers two messages, (1) the association between a query and an image is based on viewers’ clicks, that is, human intelligence from web-scale users; (2) users may have considerable knowledge to the associations because they might be partially aware of what they are looking for and search engines are getting much better at satisfying user intent. Both characteristics of click-throughs reduce concerns of incorrect associations. Moreover, the Internet users’ knowledge enables discovering name-attribute associations with high generality to more countries.
In the experiments, the proposed name-attribute associations are demonstrated with competitive accuracy compared to using manual labeling. It also benefits profiling social media users and keyword-based face image retrieval, especially the adaption to unseen names. This is the first work to interpret a name to demographic attributes in visual-data-driven manner using web search logs. In the future, we are going to extend the visual interpretation of an abstract name to more targets for which naming conventions are highly influenced by visual appearance.
The use of mobile telepresence robots (MTRs) is increasing. Very few MTRs have autonomous navigation systems. Thus teleoperation is usually still a manual task, and often has user experience problems. We believe that this may be due to (1) the fixed viewpoint and limited field of view of a 2D camera system and (2) the capability of judging distances due to lack of depth perception.
To improve the experience of teleoperating the robot, we evaluated the use of stereo video coupled with a head-tracked and head-mounted display.
To do this, we installed a brushless gimbal with a stereo camera pair on a robot platform. We used an Oculus Rift (DK1) device for visualization and head tracking.
We conducted a preliminary user study to gather qualitative feedback about telepresence navigation tasks using stereo vs. a 2D camera feed, and high vs. low camera placement. In a simulated telepresence scenario, participants were asked to drive the robot from an office to a meeting location, have conversation with a tester, then drive back to the starting location.
An ANOVA on System Usability Scale (SUS) scores with visualization type and camera placement as factors results in a significant effect of visualization type on the score. However, we observed a higher SUS score for navigation based on a 2D camera feed. The camera placement height did not show a significant effect.
The following two main reasons could have caused the lower ratings for stereo: (1) about half of the users experienced at least some form of disorientation. This might have been due to their unfamiliarity with immersive VR headsets but also due the sensory distortion effect of being immersed visually in a moving environment while other bodily senses report sitting still. (2) the video transmission quality was not optimal due to interference of the analog video transmission signal by objects in the building and due to the relatively low display resolution of the Oculus Rift DK1 device.
In the future we intend to work on improving the visual quality of the stereo output by using better video transmission and head-worn display. We furthermore intend to evaluate robot navigation tasks using a full VR view. This view will make use of the robot’s sensors and localization system in order to display the robot correctly within a virtual representation of our building.
At next week’s CSCW 2015 conference, a group from University of Wisconsin-Madison will present an interesting piece of work related to the last post: “Handheld or Handsfree? Remote Collaboration via Lightweight Head-Mounted Displays and Handheld Devices”. Similar to our work, the authors compared the use of Google Glass to a tablet-based interface for two different construction tasks: one simple and one more complex. While in our case study participants created tutorials to be viewed at a later time, this test explored synchronous collaboration.
The authors found that Google Glass was helpful for the more difficult task, enabling better and more frequent communication, while for the simpler task the results were mixed. This more-or-less agrees with our findings: HMDs are helpful for capturing and communicating complicated tasks but less so for table-top tasks.
Another key difference between this work and ours is that the authors relied on Google Hangouts to stream videos. However, as the authors write, “the HMD interface of Google Hangouts used in our study did not offer [live preview feedback],” a key feature for any media capture application.
At FXPAL, we build systems when we are limited by off-the-shelf technology. So when we discovered a related capture feedback issue in early pilots we were able to quickly fix it in our tool. Of course in our case the technology was much simpler because we did not need to implement video streaming. However, since this paper was published we have developed mechanisms to stream video from Glass, or any Android device, using open WebRTC protocols. More than that, our framework can analyze incoming frames and then stream out arbitrary image data, potentially allowing us to implement many of the design implications the authors describe in the paper’s discussion section.
Our IEEE Pervasive paper on head-mounted capture for multimedia tutorials was recently accepted and is currently in press. We are excited to share some our findings here.
Creating multimedia tutorials requires two distinct steps: capture and editing. While editing, authors have the opportunity to devote their full attention to the task at hand. Capture is different. In the best case, capture should be completely unobtrusive so that the author can focus exclusively on the task being captured. But this can be difficult to achieve with handheld devices, especially if the task requires that the tutorial author move around an object and use both hands simultaneously (e.g., showing how to replace a bike derailleur).
For this reason, we extended our ShowHow multimedia tutorial system to support head-mounted capture. Our first approach was simple: a modified pair of glasses with a Looxcie camera and laser guide attached. While this approach interfered with the user’s vision less than other solutions, such as a full augmented reality system, it nonetheless suffered from an array of problems: it was bulky, it was difficult to control, and without a display feedback of the captured area it was hard to frame videos and photos.
Our first head-mounted capture prototype
Luckily, Google Glass launched around this time. With an onboard camera, a touch panel, and display, it seemed an excellent choice for head-mounted capture.
Our video application to the Glass Explorers program
To test this, we built an app for Google Glass that requires minimal attention to the capture device and instead allows the author to focus on creating the tutorial content. In our paper, we describe a study comparing standalone capture (camera on tripod) versus head-mounted (Google Glass) capture. Details are in the paper, but in short we found that tutorial authors prefer wearable capture devices, especially when recording activities involving larger objects in non-tabletop environments.
The ShowHow Google Glass capture app
Finally, based on the success of Glass for capture we built and tested an access app as well. A detailed description of the tool, as well as another study we ran testing its efficacy for viewing multimedia tutorials, is the subject of an upcoming paper. Stay tuned.
The ShowHow Google Glass access app
Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project. One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content. We are working towards this goal by exploiting two key tools. First, we want to use real-time content analysis to expose useful structure within multimedia content. Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation. Last year, we began exploring these ideas in our work on content-based video copy and paste.
As another embodiment of these ideas, we demonstrated video text retouch at UIST last week. Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines. When a user clicks on a frame, a live cursor appears next to the nearest word. At this point, users can alter text directly using the keyboard. When they do so, a video overlay is created to capture and display their edits.
Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.
By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams. We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.
In the recent paper published at SUI 2014,”Exploring Gestural Interaction in Smart Spaces using Head-Mounted Devices with Ego-Centric Sensing”, co-authored with Barry Kollee and Tony Dunnigan, we studied a prototype Head Mounted Device (HMD) that allows the interaction with external displays by input through spatial gestures.
In the paper, one of our goals was to expand the scope of interaction possibilities on HMDs, which are currently severely limited, if we consider Google Glass as a baseline. Glass only has a small touch pad, which is placed at an awkward position on the devices rim, at the user’s temple. The other input modalities Glass offers are eye blink input and voice recognition. While eye blink can be effective as a binary input mechanism, in many situations it is rather limited and could be considered socially awkward. Voice input suffers from recognition errors for non-native speakers of the input language and has considerable lag, as current Android-based devices, such as Google Glass, perform text-to-speech in the cloud. These problems were also observed in the main study of our paper.
We thus proposed three gestural selection techniques in order to extend the input capabilities of HMDs: (1) a head nod gesture, (2) a hand movement gesture and (3) a hand grasping gesture.
The following mock-up video shows the three proposed gestures used in a scenario depicting a material selection session in a (hypothetical) smart space used by architects:
We discounted the head nod gesture after a preliminary study showed a low user preference for such an input method. In a main study, we found that the two gestural techniques achieved performance similar to a baseline technique using the touch pad on Google Glass. However, we hypothesize that the spatial gestural techniques using direct manipulation may outperform the touch pad for larger numbers of selectable targets (in our study we had 12 targets in total), as secondary GUI navigation activities (i.e., scrolling a list view) are not required when using gestures.
In the paper, we also present some possibilities for ad-hoc control of large displays and automated indoor systems:
Considering the larger picture, our paper touches on the broader question of ego-centric vs exo-centric tracking: past work in smart spaces has mainly relied on external (exo-centric) tracking techniques, e.g., using depth sensors such as the Kinect for user tracking and interaction. As wearable devices get increasingly powerful and as depth sensor technology shrinks, it may, in the future, become more practical to users to bring their own sensors to a smart space. This has advantages in scalability: more users can be tracked in larger spaces, without additional investments in fixed tracking systems. Also, a larger number of spaces can be made interactive, as the users carry their sensing equipment from place to place.
I asked FXPAL alumni Jeremy Pickens to contribute a post on the best paper award at IIiX which is named after our late colleague Gene Golovchinsky. For me, the episode Jeremy recounts exemplifies Gene’s willingness and generosity in helping others work though research questions. The rest of this post is written by Jeremy.
I recently had the opportunity to attend the Information Interaction in Context conference at the University of Regensburg. It is a conference which attempts to bring together the systems and the user perspective on information retrieval and information seeking. In short, it was exactly the type of conference at which our colleague Gene Golovchinsky was quite at home. In fact, Gene had been one of the chairs of the conference before his passing last year. The IIiX organizers made him an honorary chair. During his time as chair, Gene secured FXPAL’s sponsorship of the conference including the honorarium that accompanied the Best Paper award. The conference organizers decided to officially give the award in Gene’s memory, and as a former FXPAL employee, I was asked to present the award and to say a few words about Gene.
I began by sharing who I knew Gene to be through the lens of our first meeting. It was 1998. Maybe 1999. Let’s say 1998. I was a young grad student in the Information Retrieval lab at UMass Amherst. Gene had recently convinced FXPAL to sponsor my advisor’s Industrial Advisory Board meeting. This meant that once a year, the lab would put together a poster session to give IAB members a sneak preview of the upcoming research results before they appeared anywhere else.
Well, at that time, I was kinda an odd duck in the lab because I had started doing music Information Retrieval when most of my colleagues were working on text. So there I am at the IAB poster session, with all these commercial, industry sponsors who have flown in from all over the country to get new ideas about how to improve their text search engines…and I’m talking about melodies and chords. Do you know that look, when someone sees you but really does not want to talk with you? When their eyes meet yours, and then keep on scanning, as if to pretend that they were looking past you the whole time? For the first hour that’s how I felt.
Now, I’m fairly sure that he really was not interested in music IR. But not only did Gene stop and hear what I had to say, but he engaged. Before I knew it, half an hour (or at least it felt like it) had passed by, and I’d had one of those great engaging Gene discussions that I would, a few years later when FXPAL hired me, have a whole lot more of. Complete with full Gene eye twinkle at every new idea that we batted around. Gene had this way of conducting a research discussion in which he could both share (give) ideas to you, and elicit ideas from you, in a way that I can only describe as true collaboration.
After the conference dinner and presentation had concluded, there were a number of people that approached me and shared very similar stories about their interactions with Gene. And a number of people who expressed the sentiment that they wished they’d had the opportunity to know him.
I should also note that the Best Paper award went to Kathy Brennan, Diane Kelly, and Jamie Arguello, for their paper on “The Effect of Cognitive Abilities on Information Search for Tasks of Varying Levels of Complexity“. Neither I nor FXPAL had a hand in deciding who the best paper recipient was to be; that task went to the conference organizers. But in what I find to be a touching coincidence, one of the paper’s authors, Diane Kelly, was actually Gene’s summer intern at FXPAL back in the early 2000s. He touched a lot of people, and will be sorely missed. I miss him.
Last year, we initiated the LoCo project on indoor location. The LoCo page has more information, but our central goal is to provide highly accurate, room-level location information to enable indoor location services to complement the location services built on GPS outdoors.
Last week, we presented our initial results on the work at Ubicomp 2014. In our paper, we introduce a new approach to room-level location based on supervised classification. Specifically, we use boosting in a one-versus-all formulation to enable highly accurate classification based on simple features derived from Wi-Fi received signal strength (RSSI) measures. This approach offloads the bulk of the complexity to an offline training procedure, and the resulting classifier is sufficiently simple to be run on a mobile client directly. We use a simple and robust feature set based on pairwise RSSI margin to both address Wi-Fi RSSI volatility.
The equation above shows an example weak learner which simply looks at two elements in an RSSI scan and compares their difference against a threshold. The final strong classifier for each room is a weighted combination of a set of weak learners greedily selected to discriminate that room. The feature is designed to express the ordering of RSSI values observed for specific access points, and a flexible reliance on the difference between them, and the threshold is determined in training. An additional benefit of this choice is that processing a subset of the RSSI scan according to the selected weak learners further reduces the required computation. Comparing against the kNN matching approach used in RedPin [Bolliger, 2008], our results show competitive performance with substantially reduced complexity. The Table below shows cross validation results from the paper for two data sets collected in our office. The classification time appears in the rightmost column.
We are excited about the early progress we’ve made on this project and look forward to building out our indoor location system in several directions in the near future. But more than that, we look forward to building new location driven applications exploiting this technique which can leverage existing infrastructure (Wi-Fi networks) and devices (cell phones) we already use.
At ICME 2014 in Chengdu, China, we presented a technical demo called “Gesture Viewport,” which is a projector-camera system that enables finger gesture interactions with media content on any surface. In the demo, we used a portable Pico projector to project a viewport widget (along with its content) onto a desktop and a Logitech webcam to monitor the viewport widget. We proposed a novel and computationally efficient finger localization method based on the detection of occlusion patterns inside a virtual “sensor” grid rendered in a layer on top of the viewport widget. We developed several robust interaction techniques to prevent unintentional gestures from occurring, to provide visual feedback to a user, and to minimize the interference of the “sensor” grid with the media content. We showed the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View. Click on the following link to watch a short video clip that illustrates these scenarios.
Many people who had seen the demo were impressed. They thought that the idea behind the demo, that is the proposed occlusion pattern based finger localization method, was very clever. That probably is a big reason why we won the Best Demo Award at ICME 2014. For more details of the demo, please refer to this paper.