Last year I intended to do stupendously rich articles about Ancestry.com Bloggers Day presentations. Since I never got around to it, this year you’re getting my stupidously poor notes.
Mike Wolfgramm, senior vice president, and Jonathan Young, vice president, of development gave us the last presentation prior to lunch. Their opening slide listed the six key agenda items below. In this article I’ll cover the first item. Tomorrow I’ll finish the remaining five items.
Key technologies we will talk about:
* Flexible content digitization pipeline
* Named entity extraction
* Vertical [unique to Ancestry.com] search engine
* Record linking
* Hint engine – technology behind the shaky leaf
* PersonRank – Search engine that powers Mundia (pronounced, “Moon-dia”)
It is a challenge to handle different document characteristics
* Documents are very different: census, Chinese family history, manuscript, newspaper, German directories
Flexible Content Pipeline “Dexter” – applies only the appropriate steps from a plug-in toolset:
Scan manager* – handles alternate multi spectral scans.
- Showed the examples that I included last Monday.
- Kimberly Powell suggested including the spectrum type in the source metadata. I was not able to locate any alternate spectrum images online. I suppose these are yet to be published.
Watermarking* – intelligent algorithms for placement of watermark
- Ancestry.com method has 0.04%
- Automatically classifies document types per page
- Routes desired page to next stage
- Wolfgramm mentioned an example from the British Army WWI Pension Records 1914-1920. Only the first page—the Attestation page—of a soldier’s pension file required indexing. Automatic classification distinguished these pages (below left) from other pages (below right).
Auto name fielding* is necessary because computers are extremely stupid and must be taught to recognize names. This works best for consistently formatted pages. Yearbooks are an example. The computer must be taught that the words to the left of the pictures are names, family name first, an then given names (below).
Even though the computer is taught how to look for names, it still has a hard time. Because OCR is used instead of indexers, not all names are recognized. I tested the example above and found Eileen Draper and Joan Clark were recognized, but not John Earl or Mary Jo Ellis.
Inference engine – advanced algorithms to create inferences from existing data. Infer birth year from age, infer mother from wife, infer father from son-in-law
Field normalization - [I think this is the same as “Named entity extraction,” below.]
Dexter tools, above, that are followed by an asterisk * are technologies with patents or patent applications.
Tomorrow we’ll finish up the technology presentation.
Mike Wolfgramm serves as senior vice president of development and is responsible for the development and delivery of Ancestry.com (the website). Wolfgramm has over fifteen years of experience in technology and product development. Prior to Ancestry.com, he served as senior architect and senior director of development at Open Market, Inc., where he managed the overall development of the Infobase technology application which had more than 35 million customers worldwide. Wolfgramm began his career at WordPerfect in Orem, Utah. He is a graduate of Brigham Young University where he received a bachelor's degree in computer science.
Jonathan Young is vice president of development for Ancestry.com. He joined Ancestry.com from Earthlink, where he served as vice president of development and was responsible for development, testing, subscription, and billing platforms across multiple sites. Prior to Earthlink, Young spent ten years at Turner Internet Technologies, where he served as vice president of product development for Turner’s internet properties. Young earned his bachelor’s degree in astrophysics and Asian studies from Williams College.