Open-source queries

on

Every once in a while a Twitter query turns up something completely unexpected. I suppose that’s one reason for having them.  My query on all things PubMed recently turned up the following gem: a blog entitled PubMed Search Strategies. What is it? A list of queries. What? PubMed Queries, in all the Boolean glory. The latest pair of posts are pharmacoepidemiology — keywords, and its paternal twin, pharmacoepidemiology — MeSH.  The queries run for 39 and 13 terms, respectively. No average 2.3 word Web searches these.

So what are they for? They appear to be by-products of medical searches designed to characterize specific concepts in various ways. They are shared for the same reason that people share open-source code: to benefit others and to build on the work of others.

From an information-seeking perspective, I find these queries fascinating. The first one yielded 963,962 hits, of which 129,619 were review articles. By default, PubMed sorts search results by date of publication (latest first), or you can arrange these in other useful orders such as the alpha by first author, by last author, by journal or by title. I cannot imagine beginning to make sense of this list. (For the record, the second set of results produces a mere 325,835, with 54,071 review articles.) Precision-recall tradeoff indeed.

So, it appears that like much open source code, these queries are mere components rather than finished tools that must be used in combination with other expressions to create useful queries. Each is like a complex facet that must be combined in some ruthless manner with other expressions to reduce these hundreds of thousands of results to a more human scale.

I wonder, though, if these expressions couldn’t be used for other purposes as well. There appear to be 49 posts on the site to date, mostly pairs of MeSH and keyword queries. It might be interesting to compare the lists to analyze which documents are retrieved by both queries, and which by only one. It seems that their ultimate effectiveness is predicated on combinations with other topics, but which variant — the keyword,  the MeSH, or perhaps a hybrid — offers the best performance? What factors affect this performance? Would such queries be more useful with a best-match search engine (rather than a Boolean one), or would the number of terms make such computations prohibitively expensive for interactive use?

At this point, I don’t have the domain expertise to construct or evaluate such queries, but it would be interesting to work with someone to perform these experiments. Perhaps some light could be shed on the effectiveness of MeSH queries for medical information seeking.

7 Comments

  1. Sarah Vogel says:

    Hi Gene,

    I’m a librarian and professional literature searcher who uses Medline (PubMed) and other literature databases such as Embase, Biosis and more extensively in my work. I found your post really interesting and would be more interested in hearing about your thoughts on best match engines.

    To give you a little background, the website was created by another librarian and most of the strategies on there were created by librarians to help answer specific questions that they have gotten in the course of their work. In my experience, I can’t always take the strategy “as is” but it can provide me with useful direction for what I do and do not want to include in my own strategy.

    When I use them, I’m generally trying to do a very comprehensive (and yet targeted) search that retrieves most of the literature on a particular question. In general I’ll be combining several concepts so even though the results an individual set of terms might be huge (like pharmacoepidemiology), my end set of results is usually (though not always) much more manageable.

    For the type of searching I do, I feel a lot more confidence in Boolean engines that I’ve used than in the other types of engines. When people come to me for a search they are often doing a major survey of the literature and need to feel confident that we’ve retrieved the vast majority of relevant literature. In other words, finding 10 or 12 or even 100 really on-target articles may not satisfy their needs.

    As a searcher, I love MeSH and find it really valuable in many instances. Having said that, I would rarely, if ever, depend on MeSH terms for my only search terms if I was doing a search that needed to be as comprehensive as possible. MeSH also works better in some settings than in others. I work in the pharma/biotech field and many of the concepts we look at aren’t particularly well indexed by MeSH. (My favorite bad example – monoclonal antibodies as therapeutics – awful indexing!) However many concepts in clinical medicine are very well indexed and I love using the hierarchical features of MeSH to capture large subject areas such as cancer or cardiology to combine with other concepts.

    I’d love to hear more about any experiments you might do on the topic. In my experience, literature searching is as much an art as a science so if you can bring more “science” to a search engine you could really improve end-user experience.

  2. Thanks for your comments, Sarah! I understand why you use Boolean queries for your searches, and that’s sort of what I was alluding to when I called these searches “facets” that you would combine (as you say) when you’re looking for information that is related to several concepts.

    On the other hand, I think PubMed’s sorting of data is a cop-out driven by the need to minimize the required computation. A better approach — even with Boolean search — would be to offer the possibility ranking the results based on the quality of the match. The more query terms, the better the match, just to simplify a bit.

    But you are right — it is much harder to know whether you’ve succeeded when you’re doing a recall-oriented search than when you’re interested in one or two documents. So a research challenge for people building and evaluating search engines is how to support recall-oriented search without forcing the searcher to examine too many irrelevant documents.

    This may be a case of the grass being greener on the other side, but from a research perspective, I would love to hear about the art of search because if we truly understand that practice, we might actually be able to build systems that lead to much better end-user experience!

  3. Sarah Vogel says:

    Hi Gene,

    Thanks for your response! I can imagine that Medline & MeSH are not optimized to make the best use of the computing power that is currently available. As you probably know, Medline started out as a print index in the 1870s. It went online in the mid-1960’s and continued to be both a print and online index until 2004. Some of the oddities of the Medline data that people complain about arise because of its long history. With all that history behind it, I imagine it’s like trying to change directions in an aircraft carrier to make substantive changes to either the data or the interface.

    Interestingly, when I do a recall-oriented search, I often do it in stages so that I can pull up the most likely references first and then broaden the search out as I see what strategies are working and what aren’t. So I sort of manually achieve the sorting you mention. It definitely would be easier if PubMed did it automatically.

    Could such a system allow the user to do weighting of terms & dates? One of the issues I’ve seen with some systems I’ve used is that “best” matches are often not really my best matches. If I could tell the system (or it could accurately predict based on linguistics or some internal knowledge base) what the most important terms are, I would be really happy.

    I’m not an average end-user searcher so many of the issues I deal with don’t come up for most PubMed users. However, the reason doctors & scientists consult people like me is that it isn’t easy to structure a good, comprehensive search on PubMed (and PubMed, surprise, surprise, is not a complete source of biomedical literature). So improving the search engine could really help with that. Might put me out of work but it would be interesting in the process!

  4. I am not sure that search engines are going to put anyone out of a job in the foreseeable future. People who tell you otherwise are trying to sell you something. Such as a search engine, for example.

    To answer your question about weights and dates, yes, there are systems that let the searcher specify that kind of information. Lucene is an example of a fairly widely-available indexing/retrieval tool that permits term weights. HubMed can use Lucene syntax, for example. Of course it doesn’t have most of the metadata-related features that PubMed has, so its utility may be diminished by that. It may be possible to use the two in combination, however, to see PubMed searches with results from HubMed.

  5. […] Vogel’s comment on yesterday’s post got me thinking about recall-oriented search. She wrote about preferring […]

  6. dinesh vadhia says:

    @sarah
    From your considerable experience what is more important (or useful)?:
    i) search results containing a ranked list of documents containing the specified keywords and/or phrases
    ii) search results containing a ranked list of documents similar to one or more other documents eg. here are 3 documents on the subject of ‘pharmacoepidemiology’, now find all documents similar to these 3 and returned the results in ranked order

  7. Sarah Vogel says:

    Dinesh,

    This is one of those contextual questions and the answer is that both are useful to me in certain situations.

    On the rare occasions that I search only one database option i) is very nice and saves me a lot of time in post-processing my search results. The problem with it from my perspective is that I generally search anywhere from 2 – 15 scientific databases when doing a search so that to be really useful, I need the feature to apply to data from multiple databases (& de-dupe).

    I use option ii) (in databases that have it) a lot as I explore a new question and start to build up a search strategy. It’s a one way (though not the only way) of learning about how scientists talk about their research so that you can create a more comprehensive search strategy. To be honest, I don’t use it much in my final search strategies because I need to be able to document what I searched and how I got the results. “More like this” features supplement my searches and I find them helpful but they aren’t critical to my work.

    I’m a professional searcher so my needs are often pretty esoteric compared to the average information seeker. While I used to work a lot with end-users doing training and trouble-shooting, I haven’t done it that in several years so my perceptions of their needs are probably totally out of date.

Comments are closed.