Finding facets


I’ve been messing around with Twitter search, which (on a small scale) led me to store structured tweet, people and document data. I used a relational database to store the data I got from Twitter, and everything worked just fine. (That is, performance was limited by the Twitter API and Twitter search API, not by my database.) But say you have lots of data, and it includes text and structure, and you want to search it. What if you’re Twitter or LinkedIn? Can you still use MySQL or Oracle or whatever to store your data and serve up search results?

At a recent SDForum talk on the search capabilities of LinkedIn, John Wang described how LinkedIn handles its faceted search. The talk covered a wide range of topics around managing scalability that are undoubtedly shared by many web companies: how to handle real-time updates, how to scale to millions of users, etc. LinkedIn uses Lucene and other related tools, and to their credit has made contributions to the Lucene open source tool set, including Bobo and Zoie.

In particular, John described Bobo, a faceted search engine written on top of Lucene. Bobo makes it easy to retrieve structured and semi-structured data from Lucene indexes, and apparently is quite efficient. When working with hybrid (structured and full-text) data, I’ve either stuck the full text (not too much) into a relational database and used the LIKE operator to do the full-text search, or built a Lucene index with a bunch of fields and performed used those fields to augment the full-text search. Bobo improves on the latter approach by combining Lucene’s capable full-text search with a powerful mechanism for querying structured data.

For example, it can handle queries such as identifying the “top 10 facets of car types ordered by count with a min count of 5” (from the Bobo wiki). While relational databases can be used to produce similar results, they are not necessarily as efficient at it. The Bobo wiki shows a comparison between Bobo/Lucene and MySQL on a largish collection (3M records) of semi-structured data in which Bobo outperforms mySQL on several kinds of queries by one to two orders of magnitude. It’s quite possible that MySQL can be tuned further than the example suggests to improve performance, but the results are still quite striking.

One place where Bobo wins is flexibility: while MySQL outperforms it on ordering records from an indexed table, Bobo is 31 times faster when sorting on two fields. Since it is unlikely that all combinations of fields can be indexed in a relational database, Bobo provides a particularly efficient means of doing exploratory search on structured data. This appears to be yet another example in which a purpose-built database engine out-performs a relational database.

Of course the problem with HCIR is not just IR, it’s HCI as well: to make use of this interesting and powerful technology, you have to build sophisticated user interfaces. Even a simple results-filtering interface based on facets identified in the data will  require considerably more code than the library calls to retrieve the data. I shudder to think of the javascript required to produce a pleasant user experience.

I wonder if there is a Lucene framework for displaying and selecting faceted data?

Share on: 


  1. Yes, check out Apache Solr, which is in the Lucene family. It provides much more than just faceted search.

  2. Twitter Comment

    FXPAL Blog » Blog Archive » Finding facets [link to post]

    Posted using Chat Catcher

  3. Twitter Comment

    FXPAL Blog » Blog Archive » Finding facets [link to post]

    Posted using Chat Catcher

  4. […] a treat to get this look under the hood, as well as to finally meet John in person. I also ran into Gene Golovchinsky there–so much for my spending a few days on the west coast […]

  5. […] wanted to clarify the point I tried to make in my blog post about Bobo and LinkedIn’s use of faceted search. I ended that post with a confusing question […]

  6. Hi, Gene,

    How is DBSight working for you?


Comments are closed.