The Ancestry Insider: Ancestry.com Bloggers Day: Technology

Monday, February 1, 2010

Ancestry.com Bloggers Day: Technology

Last year I intended to do stupendously rich articles about Ancestry.com Bloggers Day presentations. Since I never got around to it, this year you’re getting my stupidously poor notes.

Mike Wolfgramm, senior vice president, and Jonathan Young, vice president, of development gave us the last presentation prior to lunch. Their opening slide listed the six key agenda items below. In this article I’ll cover the first item. Tomorrow I’ll finish the remaining five items.

Key technologies we will talk about:

* Flexible content digitization pipeline

* Named entity extraction

* Vertical [unique to Ancestry.com] search engine

* Record linking

* Hint engine – technology behind the shaky leaf

* PersonRank – Search engine that powers Mundia (pronounced, “Moon-dia”)

It is a challenge to handle different document characteristics

* Documents are very different: census, Chinese family history, manuscript, newspaper, German directories

Flexible Content Pipeline “Dexter” – applies only the appropriate steps from a plug-in toolset:

Scan manager* – handles alternate multi spectral scans.

- Showed the examples that I included last Monday.

- Kimberly Powell suggested including the spectrum type in the source metadata. I was not able to locate any alternate spectrum images online. I suppose these are yet to be published.

Auto de-skew*

Auto crop*

Image QA

Watermarking* – intelligent algorithms for placement of watermark

Table of contents tool

Auto leveling

Image conversion

Binarization* * – converting a grayscale image to black and white [also called thresholding]

- Niblack’s method has 0.67% OCR error rate

- Ancestry.com method has 0.04%

Auto Classification

- Automatically classifies document types per page

- Routes desired page to next stage

- Wolfgramm mentioned an example from the British Army WWI Pension Records 1914-1920. Only the first page—the Attestation page—of a soldier’s pension file required indexing. Automatic classification distinguished these pages (below left) from other pages (below right).

Auto name fielding* is necessary because computers are extremely stupid and must be taught to recognize names. This works best for consistently formatted pages. Yearbooks are an example. The computer must be taught that the words to the left of the pictures are names, family name first, an then given names (below).

Even though the computer is taught how to look for names, it still has a hard time. Because OCR is used instead of indexers, not all names are recognized. I tested the example above and found Eileen Draper and Joan Clark were recognized, but not John Earl or Mary Jo Ellis.

Keying tool*

Fact extraction

Inference engine – advanced algorithms to create inferences from existing data. Infer birth year from age, infer mother from wife, infer father from son-in-law

Field normalization - [I think this is the same as “Named entity extraction,” below.]

And more…

Dexter tools, above, that are followed by an asterisk * are technologies with patents or patent applications.

Tomorrow we’ll finish up the technology presentation.

Mike Wolfgramm serves as senior vice president of development and is responsible for the development and delivery of Ancestry.com (the website). Wolfgramm has over fifteen years of experience in technology and product development. Prior to Ancestry.com, he served as senior architect and senior director of development at Open Market, Inc., where he managed the overall development of the Infobase technology application which had more than 35 million customers worldwide. Wolfgramm began his career at WordPerfect in Orem, Utah. He is a graduate of Brigham Young University where he received a bachelor's degree in computer science.

Jonathan Young is vice president of development for Ancestry.com. He joined Ancestry.com from Earthlink, where he served as vice president of development and was responsible for development, testing, subscription, and billing platforms across multiple sites. Prior to Earthlink, Young spent ten years at Turner Internet Technologies, where he served as vice president of product development for Turner’s internet properties. Young earned his bachelor’s degree in astrophysics and Asian studies from Williams College.

1 comment:

cactus7October 29, 2010 at 11:02 AM
What will Ancestry have for smart phones?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Biography

The Ancestry Insider was a readers’ choice for the top four genealogy news and resources blogs, part of Family Tree Magazine’s “40 Best Genealogy Blogs” for 2010. He reports on the two big genealogy organizations, Ancestry.com and FamilySearch. He was named a “Most Popular Genealogy Blogs” by ProGenealogists, and has received Family Tree Magazine’s “101 Best Web Sites” award every year since 2008. A genealogical technologist, the Insider has a post-graduate technology degree and holds a dozen technology patents in the United States and abroad. He has done genealogy since 1972 and has worked in the computer industry since 1978. He was Time Magazine Man of the Year in both 1966 and 2006. And he really is descended from an Indian princess.

Subscribe by Email