Ancestry.com’s Crista Cowan recently interviewed Laryn Brown, senior product manager, about Ancestry’s new data extraction technology. Ancestry is using the technology to make it easier to find people in their U.S. City Directory collection. The collection has been available for some time using an OCR index.
OCR, optical character recognition, is a software process wherein a computer program attempts to read the images and create a matching document with all the words found on the image. After the task of recognizing words, the computer still doesn’t know what the words mean. What you and I easily recognize as a person’s name is beyond the computer’s ability to identify with any degree of certainty. That’s why in the past it has been so difficult to find someone in a city directory. That is, until now.
Ancestry has developed a technology that uses the regular layout of a city directory to help the computer recognize names, addresses, occupations, and so forth. The technology makes it possible to create a regular database with a regular index (rather than an OCR index). You can link records to your tree. You can make corrections to the index. You can search using fields such as name, address, and so forth.
I searched for David O McKay in Salt Lake City, Utah with spouse name “Emma.” With the old database and the old technology, Ancestry was not able to find any results.
With the new (beta) technology, I easily found 12 instances from 1929 to 1965.
Looking at the three subsequent names (see image to the above/right) I found that Ancestry correctly interpreted all the names, spouses, and occupations. It got all but one address, misinterpreted the address of Edw R McKay to be a person named “Temple McKay.”
Still quite impressive.
To see the interview, click on the video below, or click here to watch it online.