Thursday, May 3, 2012

Data Extraction Technology at Ancestry.com

Ancestry.com’s Crista Cowan recently interviewed Laryn Brown, senior product manager, about Ancestry’s new data extraction technology. Ancestry is using the technology to make it easier to find people in their U.S. City Directory collection. The collection has been available for some time using an OCR index.

OCR, optical character recognition, is a software process wherein a computer program attempts to read the images and create a matching document with all the words found on the image. After the task of recognizing words, the computer still doesn’t know what the words mean. What you and I easily recognize as a person’s name is beyond the computer’s ability to identify with any degree of certainty. That’s why in the past it has been so difficult to find someone in a city directory. That is, until now.

Ancestry has developed a technology that uses the regular layout of a city directory to help the computer recognize names, addresses, occupations, and so forth. The technology makes it possible to create a regular database with a regular index (rather than an OCR index). You can link records to your tree. You can make corrections to the index. You can search using fields such as name, address, and so forth.

David O McKay in 1965 city directory of Salt Lake City, UtahTo test the technology, I performed the same search in “U.S. City Directories, 1821-1989 (Beta)” and “U.S. City Directories.”

I searched for David O McKay in Salt Lake City, Utah with spouse name “Emma.” With the old database and the old technology, Ancestry was not able to find any results.

With the new (beta) technology, I easily found 12 instances from 1929 to 1965.

That’s impressive.

Looking at the three subsequent names (see image to the above/right) I found that Ancestry correctly interpreted all the names, spouses, and occupations. It got all but one address, misinterpreted the address of Edw R McKay to be a person named “Temple McKay.”

Still quite impressive.

To see the interview, click on the video below, or click here to watch it online.

Behind the Scenes: Data Extraction Technology and City Directories

Ancestry executives demonstrated the technology at RootsTech. Click here and skip to time index 30:00 to see a behinds-the-scenes look at the production tool.

2 comments:

  1. Surely you jest....or are being paid by ancestry.com for this completely FALSE drivel. Ancestry.com's city directories are a horrible example of what happens when indexing is outsourced to China....

    Just because you say a thing is so doesn't make it so. You can tell 100 genealogists that the city directories are fine and the indexing is sound - and all 100 will laugh their arses off at you for saying it.

    WE KNOW BETTER. Get real.

    ReplyDelete
  2. The City Directories have totally fallen apart. I've taken to searching them page by page myself. There was nothing wrong with the OCR version. Some of us actually used the information from ads, clubs etc.. Now most things (Even DODs etc...) are frequently not indexed. Ancestry should go back to OCR if they want to farm it out to non english speakers then they can help clean up an OCRd page by fixing misreadings of letters. Or here's a novel idea bring the jobs back to America!!

    ReplyDelete