Searching genealogical data: an opportunity for research

on

On Jon Elsas’s suggestion, I dug into Ancestry.com’s genealogy web site & did some searching for my wife’s and my ancestors. In additional to the personal and historical interest, I was curious to learn about the data and the data sets from an information seeking perspective.

Ancestry.com federates thousands of databases and archives of varying size, purpose and quality. They provide an interface for searching the data, for saving results, for building up family trees, and for connecting with other people.

Searching this collection presents a range of challenges both for the system designers and for its users.

Ancestry.com provides structured indexes that include dates, names, and places for records such as census data, immigration and naturalization records, military service records, births, deaths, marriages, etc. In addition to these structured items, a record may contain a lot of other interesting information that is not indexed.

From the searcher’s perspective, the challenge is to find the right records given incomplete or conflicting starting information. To complicate matters, the data can be quite noisy, with spelling variants, missing data, and outright mistakes. Many records that appear quite similar to the query parameters may vary on dimensions for which the searcher may not have enough information to discriminate among the results or to triangulate on the right person. The problems are particularly severe for immigrants from non-English-speaking countries whose names were Anglicized at the port of entry or by the immigrants themselves. For example, in my searches, I found the following variants for one family’s last name – Volkostavtser, Volkostavsky, Volcastofski, and Volcastopki. I also found Wolkenstawska and Wolkostawcer, but I am not sure whether these people are related.

From the system builders’ perspective, there are many challenges, including what constitutes a match, how to rank partial matches, how to combine evidence of similarity that comes from dates, place names, people’s names, ancestral relations, etc. Queries can return no results, or thousands of records, and neither may be useful to the searcher. The web site provides a range of fields on which to search, making it possible to express rather complex matching criteria. Unfortunately, it doesn’t do a very good job of faceted browsing of the results. Just about the only facet that can be manipulated effectively is the collection type and years.

It would be interesting, although possibly computationally difficult, to select spelling variants for names and places, to group data geographically, and to select dates as facets of the query rather than as a complete reformulation. The advantage of faceted search over the existing mechanism is that facets would give a better sense for how much variability is present in each kind of information, giving the searcher insight into the depth of the available collections.

Finally, it’s interesting to consider whether genealogical searching represents precision-oriented, known-item searching, or recall-oriented, exploratory search. At first blush, the whole point of the exercise is to find the right individual, making it a precision-oriented activity. On the other hand, this is just a small sub-task of the larger goal that many searchers have, which is to build a family tree that consists of many individuals, complete with as much detail about their lives as can be gleaned from the historical record. Finally, we should not overlook the serendipitous discovery of new relatives or of previously-unknown details of one’s ancestors’ lives. Such discoveries can prompt new avenues of exploration, new search strategies to be adopted, new collections to be explored, etc. All such activity clearly falls in the exploratory search category.

This combination of rich and challenging data, highly motivated users, and the possibility of defining interesting task metrics, this area seems ripe for academic research from the HCIR perspective. Unfortunately, this area remains under-explored with only four articles in the entire digital library tagged with the keyword genealogy as of this writing; the query “family tree” retrieved 133 hits, but did not identify any other documents related to genealogy per se.

3 Comments

  1. Google Scholar returns about 2,000 hits for genealogists AND (“information retrieval” OR “information seeking”):

    http://scholar.google.com/scholar?q=genealogists+%22information+retrieval%22+OR+%22information+seeking%22&as_sdt=20000000001

    Certainly seems like an interesting domain for study–though I think there’s a risk of people dismissing genealogy as an amateur hobby. But I wonder how if better tools for genealogy would be useful to medical researchers.

  2. There is some work published in the ACM DL about genetic aspects of ancestry, but that seems like a very different domain, in part because of the base data that are used, and in part because of the nature of the results that are desired. As to the “amateur hobby” concern, I have three comments:
    1. There is already lots of research in HCI around people’s leisure activities and crafts, so this aspect falls well within established parameters
    2. There are companies (such as Ancestry.com) that make money providing these kinds of services, so from their perspective this kind of technology is core to their business
    3. Lessons learned on these datasets may also be valuable to other applications with uncertain, noisy, and even outright wrong data, such as in intelligence analysis.

  3. […] thought I would start it with some reflections on genealogical searching. This post builds on some earlier observations on genealogy and information […]

Comments are closed.