Thursday, August 27, 2015

The Future Will Bring Automated Indexing Tools – #BYUFHGC

Jake Gehring presenting at the 2015 BYU Conference on Family History and Genealogy“It’s not that we don’t like our [indexing] volunteers,” said Jake Gehring. “We would just rather have them work on things that only [humans] can do.” Jake is director of content development for FamilySearch and presented at the BYU Conference on Family History and Genealogy last month. This article is the third and last article about his presentation. In the first article I reported on Jake’s premise that FamilySearch Indexing is not keeping up with the number of records FamilySearch is acquiring and additional means are needed. In the second article I reported about two of those means: increasing the efficiency of human indexers and working with commercial partners. In today’s article I will report on the third means: increased automation via computers.

In the third part of his presentation, Jake spoke about “the really far-out stuff, HAL9000 kind of stuff.”

Jake showed a screen shot that we saw in Robert Kehrer’s keynote. (See “Kehrer Talks FamilySearch Transformations” on my blog.) The screen showed a color-coded obituary.

Obituary with parts of speech color coded by FamilySearch automated obituary indexing system

FamilySearch trained a computer to identify the different parts of speech. They trained the computer how to discern meaning out of a bunch of words. Notice in the example above that names of people are identified in dark green, places in brown, dates in dark blue, relationships in salmon, events in pale green, clock times in a steel blue (or would you call that a dark sky blue?), organizations in red, and buildings in goldenrod (or would you call that a mustard?).

They basically teach the computer to read. The computer is willing to extract a lot more detail from an obituary than a volunteer can easily do. And it can work really, really fast. For obituaries, computers can do in about a week and a half what it takes all of FamilySearch’s volunteers three and a half years to do. This is why in a few weeks FamilySearch is going to stop having volunteers index the current obituary project. In fact, FamilySearch has already published about 37 million obituaries this way. You may already have found and used an obituary that was indexed by a smart computer.

This applies to obituaries published since about 1977. Since that time, most obituaries have been published and stored digitally. Pre-1977 it looks a lot differently. Because the obituaries are not already digital, it is a pretty nasty OCR problem. [OCR converts the printed page to text so that the computer can subsequently try to make sense of it.] The problem is so severe, computers can recognize only about half of the words in pre-1900 newspapers.

If you were at RootsTech you may have seen the last thing Jake showed. A company named Planet entered its ArgusSearch into the Innovator Challenge. ArgusSearch is a system that reads the handwriting of documents that have not been indexed. You type in something like “Steinberg” and the program shows some records that might match that name. It won’t find all the matches. And it may return some results that aren’t matches. But this is still useful. This technology is still young, but an application like this is likely to hit real life in the next ten years.

Planet's ArgusSearch automatically read handwritten names in census records without an index.

Jake summarized by saying that while indexing is going really well—never better—unfortunately, it is just not good enough to give us all the records you need. [FamilySearch does not index all the records they acquire.] “We need to do much better. It’s not that we are not quite there; we are way behind and getting further behind every year,” he said. There are three areas that FamilySearch needs to utilize. FamilySearch needs to increase the efficiency of its indexing volunteers. FamilySearch needs more help from for-profit publishers who can bring more resources to the table. And FamilySearch needs to use computer technology to make images searchable with little or no human intervention.

“It’s an exciting time to be alive. Can you imagine the explosion of document availability once we make a bit more headway in a few of these areas?”

Jake took a couple of questions:

Q. How easy is it to use tools like Google Translate to translate Spanish records?

A. Google Translate is better at modern, generic words. If you type in the text of a letter, you would be able to get the gist of it, but it may not handle archaic words or words specific to a vital record. As long as you know a small set of terms, you can usually get by without a computerized translator. There is no magic tool currently available.

Q. Why do we sometimes key so very little from a record? While we have someone looking at the document, shouldn’t they be extracting more?

A. Because we publish both indexes and images, we index the minimal amount necessary to find the image. Why index something that no one will ever use in a search? Cook County, Illinois death certificates are an example where we indexed something that didn’t need to be. We indexed the deceased’s address, but who will ever search using the address? Sometimes we don’t get it quite right, but that’s the general principle.

Q. When will we be able to correct published indexes?

A. We’re starting now after ten years of being in the top three requested features, we’re starting to implement the feature to allow you to contribute corrections. We are rapidly approaching the point when this will be available. I’m not really authorized to say “soon,” but we have our eyes on that feature.

3 comments:

  1. I took a Family HIstory course at BYU in about 2001 or so and the teacher didn't like it or me very much when I insisted that OCR would eventually be able to read hand written documents.

    It'll never be perfect, but it sure is exciting to see it coming!

    ReplyDelete
  2. I disagree with the comment regarding addresses. I have used searching by address productively searching the newspapers at Fultonhistory.com. I would do so on ancestry or familysearch if it were available.

    ReplyDelete
  3. I recently started work on a friend's Finnish family history and must say I find Google Translate and every other language translator currently online to be 98% inadequate for translating Finnish used just 3 generations ago. More puzzling is the fact they return significantly different "translations" of exactly the same word or phrase - for instance, what Google calls "chickens" another translator calls "people". Have to think the same applies to other languages pre-WW II, or any language that has modernized/standardized into what's considered the current or modern version.

    I especially feel for anyone trying to research in Mexico and Ireland, where little that's survived violence and natural disasters is centralized and most material are still in handwritten parish or village records.

    ReplyDelete