Wednesday, May 25, 2016

Ancestry.com Preparing Large German Record Collection

Fraktur font city directory sampleAnd I’m guessing it is German city directories.

A number of years ago Ancestry.com developed technology allowing its computers to “read” (technically called OCR, or optical character recognition) and interpret U.S. city directories. (See “Data Extraction Technology at Ancestry.com.”) While the technology sometimes produces silly results, overall, it allowed Ancestry to publish over two billion records in record time and at minimal cost. The tradeoff seems reasonable.

Ancestry has nearly 700 German city directories but has yet to apply the technology to them. “The main challenge with German is the common use of the Gothic or Fraktur font when printing,” says Laryn Brown, Ancestry product manager. “This special script-like font is particularly difficult to recognize using the best OCR [optical character recognition] tools available today. Words and especially names can be misread as many of the characters in this special font are extremely similar.”

Ancestry’s solution is a quality-assurance check that compares the names the computer thinks it’s reading against a list of known names. If the computer sees something not in the name list, a reviewer is alerted. If the computer has identified a name not in the name list, the reviewer adds it. Otherwise, the name is mapped to the correct name or deleted. The results of these reviews are fed back to the computer so it can learn from its mistakes. It re-reads the books and the process is repeated. 

“When these records are finished, a random list of German words in a very difficult-to-read font will have been turned into a set of records about people that looks a lot like an annual census,” said Laryn.

This new collection will become available over several years and will include millions of pages of new content.

3 comments:

  1. You have a kind and generous heart. I have found hundreds of completely absurd errors--whole pages of them, at times--in the city directories. It is one of the things that makes me see red. It's not a few silly mistakes here and there--it's reams of misinformation! Names, places, occupations--there was clearly no attempt to ensure that the readings were correct by checking them against humans. I don't know why the Germans get better treatment.

    ReplyDelete
  2. And ancestry will command a huge fee for using these databases just like they always do.

    ReplyDelete
  3. Here is an example from last night, from the Canada Voter's list from 1935-1980--and I've seen far worse, as at least the names were spelled correctly:

    The actual directory:[Name] Ivan Proud, [Occupation] soldier, [Location] 542 Boundary Road, Renfrew, Nipissing, Ontario, Canada

    Ancestry transcription: [Name] Ivan Soldier Proud, [Occupation] Boundary Road, [Location] Renfrew, Nipissing, Ontario, Canada

    His wife was transcribed as [Name] Mrs Christine Boundary Proud, [Occupation] Road [Location] Renfrew, Nipissing, Ontario, Canada

    However, at least they got transcribed, which is more than one can say for their son, George Proud, [Occupation] student, who simply got left off completely. It can't even be corrected, since he's not on the transcription list. Which happens a LOT.

    ReplyDelete