“It’s not that we don’t like our [indexing] volunteers,” said Jake Gehring. “We would just rather have them work on things that only [humans] can do.” Jake is director of content development for FamilySearch and presented at the BYU Conference on Family History and Genealogy last month. This article is the third and last article about his presentation. In the first article I reported on Jake’s premise that FamilySearch Indexing is not keeping up with the number of records FamilySearch is acquiring and additional means are needed. In the second article I reported about two of those means: increasing the efficiency of human indexers and working with commercial partners. In today’s article I will report on the third means: increased automation via computers.
In the third part of his presentation, Jake spoke about “the really far-out stuff, HAL9000 kind of stuff.”
Jake showed a screen shot that we saw in Robert Kehrer’s keynote. (See “Kehrer Talks FamilySearch Transformations” on my blog.) The screen showed a color-coded obituary.
FamilySearch trained a computer to identify the different parts of speech. They trained the computer how to discern meaning out of a bunch of words. Notice in the example above that names of people are identified in dark green, places in brown, dates in dark blue, relationships in salmon, events in pale green, clock times in a steel blue (or would you call that a dark sky blue?), organizations in red, and buildings in goldenrod (or would you call that a mustard?).
They basically teach the computer to read. The computer is willing to extract a lot more detail from an obituary than a volunteer can easily do. And it can work really, really fast. For obituaries, computers can do in about a week and a half what it takes all of FamilySearch’s volunteers three and a half years to do. This is why in a few weeks FamilySearch is going to stop having volunteers index the current obituary project. In fact, FamilySearch has already published about 37 million obituaries this way. You may already have found and used an obituary that was indexed by a smart computer.
This applies to obituaries published since about 1977. Since that time, most obituaries have been published and stored digitally. Pre-1977 it looks a lot differently. Because the obituaries are not already digital, it is a pretty nasty OCR problem. [OCR converts the printed page to text so that the computer can subsequently try to make sense of it.] The problem is so severe, computers can recognize only about half of the words in pre-1900 newspapers.
If you were at RootsTech you may have seen the last thing Jake showed. A company named Planet entered its ArgusSearch into the Innovator Challenge. ArgusSearch is a system that reads the handwriting of documents that have not been indexed. You type in something like “Steinberg” and the program shows some records that might match that name. It won’t find all the matches. And it may return some results that aren’t matches. But this is still useful. This technology is still young, but an application like this is likely to hit real life in the next ten years.
Jake summarized by saying that while indexing is going really well—never better—unfortunately, it is just not good enough to give us all the records you need. [FamilySearch does not index all the records they acquire.] “We need to do much better. It’s not that we are not quite there; we are way behind and getting further behind every year,” he said. There are three areas that FamilySearch needs to utilize. FamilySearch needs to increase the efficiency of its indexing volunteers. FamilySearch needs more help from for-profit publishers who can bring more resources to the table. And FamilySearch needs to use computer technology to make images searchable with little or no human intervention.
“It’s an exciting time to be alive. Can you imagine the explosion of document availability once we make a bit more headway in a few of these areas?”
Jake took a couple of questions:
Q. How easy is it to use tools like Google Translate to translate Spanish records?
A. Google Translate is better at modern, generic words. If you type in the text of a letter, you would be able to get the gist of it, but it may not handle archaic words or words specific to a vital record. As long as you know a small set of terms, you can usually get by without a computerized translator. There is no magic tool currently available.
Q. Why do we sometimes key so very little from a record? While we have someone looking at the document, shouldn’t they be extracting more?
A. Because we publish both indexes and images, we index the minimal amount necessary to find the image. Why index something that no one will ever use in a search? Cook County, Illinois death certificates are an example where we indexed something that didn’t need to be. We indexed the deceased’s address, but who will ever search using the address? Sometimes we don’t get it quite right, but that’s the general principle.
Q. When will we be able to correct published indexes?
A. We’re starting now after ten years of being in the top three requested features, we’re starting to implement the feature to allow you to contribute corrections. We are rapidly approaching the point when this will be available. I’m not really authorized to say “soon,” but we have our eyes on that feature.