Monday, February 1, 2010

Ancestry.com Bloggers Day: Technology

Last year I intended to do stupendously rich articles about Ancestry.com Bloggers Day presentations. Since I never got around to it, this year you’re getting my stupidously poor notes.

Mike Wolfgramm, senior vice president, and Jonathan Young, vice president, of development gave us the last presentation prior to lunch. Their opening slide listed the six key agenda items below. In this article I’ll cover the first item. Tomorrow I’ll finish the remaining five items.

Key technologies we will talk about:

* Flexible content digitization pipeline

* Named entity extraction

* Vertical [unique to Ancestry.com] search engine

* Record linking

* Hint engine – technology behind the shaky leaf

* PersonRank – Search engine that powers Mundia (pronounced, “Moon-dia”)

It is a challenge to handle different document characteristics

* Documents are very different: census, Chinese family history, manuscript, newspaper, German directories

Ancestry.com works with a large range of sources

Flexible Content Pipeline “Dexter” – applies only the appropriate steps from a plug-in toolset:

Dexter, the Ancestry.com content pipeline has many optional tools

Scan manager* – handles alternate multi spectral scans.

-  Showed the examples that I included last Monday.

Kimberly Powell suggested including the spectrum type in the source metadata. I was not able to locate any alternate spectrum images online. I suppose these are yet to be published.

Auto de-skew*

Auto crop*

Image QA

Watermarking* – intelligent algorithms for placement of watermark
MyFamily.com watermark

Table of contents tool
Table of Contents of a book on Ancestry.com

Auto leveling

Image conversion

Binarization* * – converting a grayscale image to black and white [also called thresholding]

Niblack’s method has 0.67% OCR error rate

Ancestry.com method has 0.04%

Auto Classification

-  Automatically classifies document types per page

-  Routes desired page to next stage

-  Wolfgramm mentioned an example from the British Army WWI Pension Records 1914-1920. Only the first page—the Attestation page—of a soldier’s pension file required indexing. Automatic classification distinguished these pages (below left) from other pages (below right).
 Auto classification identified the pages that required indexing Auto classification could tell the difference between attestation pages vs. others

Auto name fielding* is necessary because computers are extremely stupid and must be taught to recognize names. This works best for consistently formatted pages. Yearbooks are an example. The computer must be taught that the words to the left of the pictures are names, family name first, an then given names (below).
NameFielding
Even though the computer is taught how to look for names, it still has a hard time. Because OCR is used instead of indexers, not all names are recognized. I tested the example above and found Eileen Draper and Joan Clark were recognized, but not John Earl or Mary Jo Ellis.

Keying tool*

Fact extraction

Inference engine – advanced algorithms to create inferences from existing data. Infer birth year from age, infer mother from wife, infer father from son-in-law

Field normalization - [I think this is the same as “Named entity extraction,” below.]

And more…

Dexter tools, above, that are followed by an asterisk * are technologies with patents or patent applications.

Tomorrow we’ll finish up the technology presentation.

Mike Wolfgramm of Ancestry.com Mike Wolfgramm serves as senior vice president of development and is responsible for the development and delivery of Ancestry.com (the website). Wolfgramm has over fifteen years of experience in technology and product development. Prior to Ancestry.com, he served as senior architect and senior director of development at Open Market, Inc., where he managed the overall development of the Infobase technology application which had more than 35 million customers worldwide. Wolfgramm began his career at WordPerfect in Orem, Utah. He is a graduate of Brigham Young University where he received a bachelor's degree in computer science.

Jonathan Young of Ancestry.com Jonathan Young is vice president of development for Ancestry.com. He joined Ancestry.com from Earthlink, where he served as vice president of development and was responsible for development, testing, subscription, and billing platforms across multiple sites. Prior to Earthlink, Young spent ten years at Turner Internet Technologies, where he served as vice president of product development for Turner’s internet properties. Young earned his bachelor’s degree in astrophysics and Asian studies from Williams College.

1 comment: