I’ve been messing around with Twitter search, which (on a small scale) led me to store structured tweet, people and document data. I used a relational database to store the data I got from Twitter, and everything worked just fine. (That is, performance was limited by the Twitter API and Twitter search API, not by my database.) But say you have lots of data, and it includes text and structure, and you want to search it. What if you’re Twitter or LinkedIn? Can you still use MySQL or Oracle or whatever to store your data and serve up search results?
At a recent SDForum talk on the search capabilities of LinkedIn, John Wang described how LinkedIn handles its faceted search. The talk covered a wide range of topics around managing scalability that are undoubtedly shared by many web companies: how to handle real-time updates, how to scale to millions of users, etc. LinkedIn uses Lucene and other related tools, and to their credit has made contributions to the Lucene open source tool set, including Bobo and Zoie.
In particular, John described Bobo, a faceted search engine written on top of Lucene. Bobo makes it easy to retrieve structured and semi-structured data from Lucene indexes, and apparently is quite efficient. When working with hybrid (structured and full-text) data, I’ve either stuck the full text (not too much) into a relational database and used the LIKE operator to do the full-text search, or built a Lucene index with a bunch of fields and performed used those fields to augment the full-text search. Bobo improves on the latter approach by combining Lucene’s capable full-text search with a powerful mechanism for querying structured data.
For example, it can handle queries such as identifying the “top 10 facets of car types ordered by count with a min count of 5” (from the Bobo wiki). While relational databases can be used to produce similar results, they are not necessarily as efficient at it. The Bobo wiki shows a comparison between Bobo/Lucene and MySQL on a largish collection (3M records) of semi-structured data in which Bobo outperforms mySQL on several kinds of queries by one to two orders of magnitude. It’s quite possible that MySQL can be tuned further than the example suggests to improve performance, but the results are still quite striking.
One place where Bobo wins is flexibility: while MySQL outperforms it on ordering records from an indexed table, Bobo is 31 times faster when sorting on two fields. Since it is unlikely that all combinations of fields can be indexed in a relational database, Bobo provides a particularly efficient means of doing exploratory search on structured data. This appears to be yet another example in which a purpose-built database engine out-performs a relational database.
I wonder if there is a Lucene framework for displaying and selecting faceted data?