A number of years ago Ancestry.com developed technology allowing its computers to “read” (technically called OCR, or optical character recognition) and interpret U.S. city directories. (See “Data Extraction Technology at Ancestry.com.”) While the technology sometimes produces silly results, overall, it allowed Ancestry to publish over two billion records in record time and at minimal cost. The tradeoff seems reasonable.
Ancestry has nearly 700 German city directories but has yet to apply the technology to them. “The main challenge with German is the common use of the Gothic or Fraktur font when printing,” says Laryn Brown, Ancestry product manager. “This special script-like font is particularly difficult to recognize using the best OCR [optical character recognition] tools available today. Words and especially names can be misread as many of the characters in this special font are extremely similar.”
Ancestry’s solution is a quality-assurance check that compares the names the computer thinks it’s reading against a list of known names. If the computer sees something not in the name list, a reviewer is alerted. If the computer has identified a name not in the name list, the reviewer adds it. Otherwise, the name is mapped to the correct name or deleted. The results of these reviews are fed back to the computer so it can learn from its mistakes. It re-reads the books and the process is repeated.
“When these records are finished, a random list of German words in a very difficult-to-read font will have been turned into a set of records about people that looks a lot like an annual census,” said Laryn.
This new collection will become available over several years and will include millions of pages of new content.