Client-side search

on

When we rolled out the CHI 2013 previews site, we got a couple of requests for being able to search the site with keywords. Of course interfaces for search are one of my core research interests, so that request got me thinking. How could we do search on this site? The problem with the conventional approach to search is that it requires some server-side code to do the searching and to return results to the client. This approach wouldn’t work for our simple web site, because from the server’s perspective, our site was static — just a few HTML files, a little bit of JavaScript, and about 600 videos. Using Google to search the site wouldn’t work either, because most of the searchable content is located on two pages, with hundreds of items on each page. So what to do?

I looked around briefly trying to find some client-side indexing and retrieval code, and struck out. Finally, I decided to take a crack at writing a search engine in JavaScript. Now, before you get your expectations up, I was not trying to re-implement Lucene in JavaScript. All I wanted was some rudimentary keyword search capability. Building that in JavaScript was not so difficult.

One simplifying assumption I could make was that my document collection was static: sorry, the submission deadline for the conference has passed. Thus, I could have a static index that could be made available to each client, and all the client needed to do was match and rank.

Each of my documents had a three character id, and a set of fields. I didn’t bother with the fields, and just lumped everything together in the index. The approach was simple, again due to lots of assumptions. I treated the inverted index as a hash table that maps keywords onto lists of document ids. OK, document ids and term frequencies. Including positional information is an exercise left to the reader.

So here is the index structure:


{"human": {"AAQ":4,"AFC":2,"AGJ":1,"AHQ":1,"AJX":3,... },
"thought":{"ABX":1,"AGZ":1,"NAC":1,"PHQ":1,"PHS":1,...},
"assumptions": {"ABX":1,"ANC":1,"ARX":1,"CYU":1,...},
...
}

This means that the term “human” occurs four times in document with id AAQ, twice in AFC, once in AGJ, etc. In addition, I store the average document length and the total number of terms because they are useful in scoring documents for retrieval.

To create it, I implemented an Index function that can take a collection of items and add bits of these items to the index. In the end, it saves the index as a JSON file. There is no stemming (to get stemming to work properly would require similar capability on the client), but there is a rudimentary stop list: I filter out all short common words, and include a few additional manually-selected terms as well.


function(submissionCodes) {
console.log('creating index');
this.stopList = ['their', 'which', 'using', 'through',
'while', 'about', 'other', 'three', 'during', 'all',
'however', 'these', 'often', 'better', 'may',
'allows', 'there', 'do', 'up', 'ways', 'who', 'g',
'j', 'm', 'way', 'set', 'four', 'then', 'no', 'less',
'could', 'so'];
var invertedIndex = {};
var nDocs = 0;
var totalTerms = 0.0;

// Removes common entries & saves the index into a JSON file
// This code was run as a nodejs script, so you need
// var fs = require("fs"); somewhere above
this.saveIndex = function(file) {
pruneIndex(60,4);
var content = JSON.stringify(this.getIndex());
fs.writeFileSync(file, 'var textIndex = ' + content);
}

// number of docs and average document length
// are useful for ranking
this.getIndex = function() {
return {index: invertedIndex, nDocs: nDocs,
aveDL: totalTerms / Math.max(1, nDocs)};
}

// Adds entries for given items into the index.
// Assumes a specific field structure
this.index = function(items) {
console.log('indexing ' + items.length + ' documents' );
for (var i=0; i minLength))
delete invertedIndex[term];
}
}
}

In addition, I created a couple of helper functions that were shared with the searcher to tokenize and do other per-processing on the query terms.


if(!String.prototype.trim) {
String.prototype.trim = function () {
return this.replace(/^\s+|\s+$/g,'');
};
}

function canonical(term) {
return term ? term.trim() : null;
}

function tokenize(text) {
return text ? text.toLowerCase().replace(/[^a-z\s]|_/g, " ").split(/\s+/) : [];
}

Once the index is created, it can be loaded into a page just like any other JavaScript (using the <script> tag). Note that when the index is saved, the string var textIndex = is prepended to the JSON serialization. This makes is possible to refer to the index via global variable in the JavaScript. You could obviously come up with a different scheme for doing this.

To search the index, I created the following object that implements ranking based on the BM25 scoring function.


function Searcher(indexData) {
var invertedIndex = indexData.index;
var nDocs = indexData.nDocs;
var aveDL = indexData.aveDL;
var k1 = 1.5;
var b = 0.75;

// public method to run the query
this.search = function(queryString) {
var results = retrieve(queryString);
results = rank(results);

return results;
}

// retrieves documents that contain at least one term
// from queryString
function retrieve(queryString) {
var results = {query: tokenize(queryString), hits: {}};

for (var t=0; t 0 ? Math.max(0, Math.log((nDocs - n_i + 0.5) / (n_i + 0.5))) : 0;
}
return idfs[term];
}
}

I know this seems like a lot of code when formatted in the post, but really there isn’t much to it. Clearly there are many, many simplifying assumptions to get it to be this compact, but it does work and of course can be extended as needed. In the end, though, it provided a very quick way to search through hundreds of items that were otherwise hard to navigate. And the person who sparked the work thought it was ‘awesome‘!

As a final caveat, I didn’t do extensive testing on this code, so there may be bugs lurking in it. Proceed with caution, and let me know if you make any improvements. Feel free to use it for your own purposes, but please give me credit if it’s due!

Creative Commons License
JavaScript Search Engine by Gene Golovchinsky is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.