While many of the systems we build at FXPAL are either deployed internally or transferred to our parent company, in some cases we get to deploy them in the real world. This week, we released TalkMiner, a system for indexing and searching video of lecture broadcasts. We’ve indexed broadcasts from a variety of sources, including the U.C. Berkeley webcast.berkeley site, the blip.tv site, and various channels on YouTube, including Google Tech Talks, Stanford University, MIT Open Courseware, O’Reilly Media, TED Talks, and NPTEL Indian Institute of Technology.
But all of these videos are already indexed by web search engines, you say; why do we need TalkMiner?
While web search engines index the text of the page in which the video is embedded, TalkMiner indexes the contents of the slides in the video, making more fine-grained retrieval of video possible. Is this useful?
Well, it turns out the deployment of The Berkeley webcasting system (developed by our president Larry Rowe while he was a professor there) showed that
… students almost always watched the lectures on-demand rather than in real-time, and they rarely watched the entire lecture. Students use the webcasts to study for exams – we could see this clearly by patterns of usage – and, they primarily wanted to review selected material covered by the instructor. In one class we discovered that for over 50% of the lectures, students watched less than 10 minutes from a 50-minute lecture and students watched the entire lecture only 10% of the time. Consequently, for using the system, effective search is a big issue.
To solve this problem, TalkMiner recognizes images of presentations in lecture video, and applies OCR to these regions to extract the slide text. This text is indexed along with the associated time codes, and can then be used to search for specific content. The video is divided into segments corresponding to slides; thumbnails of slides are shown when a video is selected. The video can then be watched end-to-end, or you can skip to a particular slide and listen from there. To help find topics of interest, slides that contain keyword matches to the query are highlighted.
The current index contains over 12,200 talks on a range of topics, and additional talks are indexed daily. Take a look at the system and let us know what you think!