Thursday, August 27, 2015

The Future Will Bring Automated Indexing Tools – #BYUFHGC

Jake Gehring presenting at the 2015 BYU Conference on Family History and Genealogy“It’s not that we don’t like our [indexing] volunteers,” said Jake Gehring. “We would just rather have them work on things that only [humans] can do.” Jake is director of content development for FamilySearch and presented at the BYU Conference on Family History and Genealogy last month. This article is the third and last article about his presentation. In the first article I reported on Jake’s premise that FamilySearch Indexing is not keeping up with the number of records FamilySearch is acquiring and additional means are needed. In the second article I reported about two of those means: increasing the efficiency of human indexers and working with commercial partners. In today’s article I will report on the third means: increased automation via computers.

In the third part of his presentation, Jake spoke about “the really far-out stuff, HAL9000 kind of stuff.”

Jake showed a screen shot that we saw in Robert Kehrer’s keynote. (See “Kehrer Talks FamilySearch Transformations” on my blog.) The screen showed a color-coded obituary.

Obituary with parts of speech color coded by FamilySearch automated obituary indexing system

FamilySearch trained a computer to identify the different parts of speech. They trained the computer how to discern meaning out of a bunch of words. Notice in the example above that names of people are identified in dark green, places in brown, dates in dark blue, relationships in salmon, events in pale green, clock times in a steel blue (or would you call that a dark sky blue?), organizations in red, and buildings in goldenrod (or would you call that a mustard?).

They basically teach the computer to read. The computer is willing to extract a lot more detail from an obituary than a volunteer can easily do. And it can work really, really fast. For obituaries, computers can do in about a week and a half what it takes all of FamilySearch’s volunteers three and a half years to do. This is why in a few weeks FamilySearch is going to stop having volunteers index the current obituary project. In fact, FamilySearch has already published about 37 million obituaries this way. You may already have found and used an obituary that was indexed by a smart computer.

This applies to obituaries published since about 1977. Since that time, most obituaries have been published and stored digitally. Pre-1977 it looks a lot differently. Because the obituaries are not already digital, it is a pretty nasty OCR problem. [OCR converts the printed page to text so that the computer can subsequently try to make sense of it.] The problem is so severe, computers can recognize only about half of the words in pre-1900 newspapers.

If you were at RootsTech you may have seen the last thing Jake showed. A company named Planet entered its ArgusSearch into the Innovator Challenge. ArgusSearch is a system that reads the handwriting of documents that have not been indexed. You type in something like “Steinberg” and the program shows some records that might match that name. It won’t find all the matches. And it may return some results that aren’t matches. But this is still useful. This technology is still young, but an application like this is likely to hit real life in the next ten years.

Planet's ArgusSearch automatically read handwritten names in census records without an index.

Jake summarized by saying that while indexing is going really well—never better—unfortunately, it is just not good enough to give us all the records you need. [FamilySearch does not index all the records they acquire.] “We need to do much better. It’s not that we are not quite there; we are way behind and getting further behind every year,” he said. There are three areas that FamilySearch needs to utilize. FamilySearch needs to increase the efficiency of its indexing volunteers. FamilySearch needs more help from for-profit publishers who can bring more resources to the table. And FamilySearch needs to use computer technology to make images searchable with little or no human intervention.

“It’s an exciting time to be alive. Can you imagine the explosion of document availability once we make a bit more headway in a few of these areas?”

Jake took a couple of questions:

Q. How easy is it to use tools like Google Translate to translate Spanish records?

A. Google Translate is better at modern, generic words. If you type in the text of a letter, you would be able to get the gist of it, but it may not handle archaic words or words specific to a vital record. As long as you know a small set of terms, you can usually get by without a computerized translator. There is no magic tool currently available.

Q. Why do we sometimes key so very little from a record? While we have someone looking at the document, shouldn’t they be extracting more?

A. Because we publish both indexes and images, we index the minimal amount necessary to find the image. Why index something that no one will ever use in a search? Cook County, Illinois death certificates are an example where we indexed something that didn’t need to be. We indexed the deceased’s address, but who will ever search using the address? Sometimes we don’t get it quite right, but that’s the general principle.

Q. When will we be able to correct published indexes?

A. We’re starting now after ten years of being in the top three requested features, we’re starting to implement the feature to allow you to contribute corrections. We are rapidly approaching the point when this will be available. I’m not really authorized to say “soon,” but we have our eyes on that feature.

Wednesday, August 26, 2015

FamilySearch Should Increase Indexing Efficiency and Utilize Partnerships

Jake Gehring presenting at the 2015 BYU Conference on Family History and GenealogyFamilySearch is not keeping up with indexing the records it digitizes and improvements in three ways could help fix this, according to FamilySearch director of content development, Jake Gehring. Yesterday I presented the first part of my remarks about his presentation at the 2015 BYU Conference on Family History and Genealogy (#BYUFHGC). Today I’ll present the second part, covering the first two of the three ways, increasing efficiency and partnering. Tomorrow I’ll present the third way, increased use of computerization.

Today’s FamilySearch Indexing (FSI) system is somewhat inefficient. FSI primarily utilizes a double-blind indexing methodology, sometimes described as A+B+arbitrate. Two indexers independently index a batch of records. If there are any differences, even one letter in one record, the entire batch is sent to a third person to arbitrate between the two values, or supply a value of their own. It turns out that 97% of all batches have at least one difference, even though what is keyed is the same for 70% of the fields. As a result, almost all records are looked at by three people. There’s a good argument that that is wasteful. For certain kinds of records and certain kinds of people [and certain kinds of fields, I might add], only one keyer is sufficient. The accuracy doesn’t get any better when involving two more people. FamilySearch has recently switched to single keying for newspapers in the last year since reading typeset material can usually be done without error. You wouldn’t want to do this for certain types of records or for beginning indexers.

A more efficient methodology is referred to as A+review. One person keys the information and a second person reviews what is keyed. All the reviewer does is indicate whether the information is correct or not. This could easily be done, even on a cell phone. This method is about 40% more efficient than the double-blind methodology because FamilySearch knows when a record needs to be keyed a second time. FamilySearch is actively working on this kind of methodology to increase the efficiency of indexing.

Jake showed three, entirely new, experimental types of indexing. Some do not even have working prototypes: keyboardless indexing, free-form indexing, and casual “micro-indexing.”

Jake showed an indexing system that allows productive use of devices without keyboards, such as smart phones. If you’ve used photo recognition in Photoshop, you have seen the paradigm before. He showed a slide showing 12 snippets of a name, such as “Henry.” (See my version, below.) These had been read from documents by a computerized handwriting recognition system. But since computers aren’t too good at reading handwriting, it presents its results to a person for verification. The person marks any that the computer got wrong. Where the computer had a good second guess, it could present that as well, allowing the person to select an alternate name, such as “Kerry.” For pre-printed forms, this works great and allows easy indexing on devices without keyboards, such as cell phones.

Snippet of name indexed as Henry

Shippet of a name that was indexed as Henry or Kerry Snippet of name indexed as Kerry
Snippet of a name that was indexed as Kerry Snippet from a page wherein one name was indexed as Kerry Snippet of a name that was indexed as Kerry
Snippet of a name that was indexed as Kerry Snippet from a page wherein one name was indexed as Kerry Snippet from a page wherein one name was indexed as Kerry

Snippet of a name that was indexed as Kerry

Snippet of name indexed as Henry Snippet of a name indexed as Kerry

Jake showed the FamilySearch Pilot Tool, another indexing system for free-form indexing. It is currently live, as a pilot. A large portion of the screen is a browser showing a record on FamilySearch.org. Along the right side is a pane where an indexer can enter names, dates, and places extracted from the document. (See the screen shot, below.) A person would use the tool to index any record that they care about and a short time later the record would be searchable. You wouldn’t have to ask for anyone’s permission. You wouldn’t have to index all the names. Anyone could take any collection desired and do some indexing. This tool is in pilot right now. FamilySearch is very interested in tools that let you index as you go. To join the pilot, send Jake an email. (I see someone has also posted the link online. See “FamilySearch Pilots Web-Based Indexing Extension” on the Tennessee GenWeb website.) There is no arbitration. If you care enough to index the image, you probably care enough to be accurate. But that supposition is something yet to be validated.

The FamilySearch Pilot Tool for indexing - Click to englarge

“Micro-indexing” could be used to make images more usable. It would be nice to be able to browse unindexed images easier. FamilySearch is very interested in an upgrade to the current browse experience. Jake showed an animated artist’s rendition of a tool, reminding us that this is just a research and development idea.

FamilySearch is interested in making it easier to find records in images that have not yet been indexed.

In micro-indexing the system might ask you really simple questions, like, “What kind of record is this?” and have you click the record type. By asking volunteers to do tiny tasks, FamilySearch might be able to gather information to make browsing images easier to find my record type, locality, and time. Just because FamilySearch doesn’t have the time to index the images, doesn’t mean they can’t be made easy to browse.

This is a mock-up of what a micro-indexing tool might look like.

In addition to talking about increasing the efficiency of indexing, Jake talked about partnering. FamilySearch is fine with the concept of trading data with other companies. FamilySearch provides images and the partner creates indexes. They may even get exclusive use of the indexes for awhile. For example, a lot of Mexico church and civil records are being indexed right now by Ancestry.com. We all get the value of it eventually. FamilySearch has similar projects going on with Findmypast (I didn’t catch the projects names) and MyHeritage (Danish census and church records, and Swedish household names). This increases the rate of indexing by bringing more indexers to the table.

Tuesday, August 25, 2015

FamilySearch Indexing Not Keeping Up – #BYUFHGC

Jake Gehring presenting at the 2015 BYU Conference on Family History and Genealogy“FamilySearch just isn’t indexing records fast enough,” said Jake Gehring. “If that is the case,…then what do we do about it?” Jake is director of content development for FamilySearch and presented at the BYU Conference on Family History and Genealogy last month. Jake’s presentation was titled “FamilySearch Indexing, Robo-keying, and Partnering, Oh My!”

In the last little while Jake has been involved in some research and development, which is really rewarding. It’s fun to work on some things that may become real someday. He emphasized to me that these things may never become real, so keep that in mind as you read.

Back in the old days an index was that thing in the back of the book, not some multi-billion name index you can search from your home. “We index records so that they get used more,” he said. We gather records for the same purpose and have been doing so since 1938, he said. FamilySearch has about 280 cameras, roughly 40 in the United States and the rest abroad.

There have been huge improvements in the technology for capturing records and making them available. Jake showed an example, a Weber County, Utah marriage license. It is one of the rare collections that FamilySearch has captured twice. A scan from microfilm looks like this:

Weber County, Utah marriage license scanned from microfilm

FamilySearch went back recently and captured the records digitally, in color.

A Weber County, Utah marriage license that was digitized in color

Granted, viewing a record scanned from microfilm is often less clear than viewing it on a microfilm reader, but you can see the huge improvement.

FamilySearch does things so the captured images are easier to use. One of the things they have done from early on with books and microfilm was catalog them. A catalog entry can specify locations, authors, subjects, and so forth. For family histories, they might put in a list of surnames, but that was about it. You had to know what you were looking for to find the records you needed.

When you think about what we do now, things are quite a bit different. Indexes contain full names and direct you to individual images. We index (as FamilySearch calls extracting) more than names. We also capture dates and places and relationships. By doing this, not only can you search for them, but FamilySearch can recommend records to you. FamilySearch calls these hints.

There is a range of things that FamilySearch can do to make records more accessible. Some can be done with less cost than others. Jake showed a diagram showing treatments that can be made to a collection. With added accessibility comes added cost. Here is my version of his diagram, including my own definitions:

Increasing the usability of a digitized genealogy record increases the cost of publishing it.

Definitions:

  • catalog entry: a single entry for an entire collection
  • film notes: individual notes for each film
  • light waypointing: dividing the images of an entire collection into a few groups containing a large number of images
  • heavy waypointing: dividing the images into more specific groups with fewer images
  • light indexing: extracting a few, basic pieces of information from an image, perhaps just a name or a name and date
  • heavy indexing: extracting most genealogically significant information
  • lineage-linkage: using the record extracts to reconstruct families with links between parents, spouses, and children

Resources are limited. The more work invested in collections, the easier it is to use them, but the number of collections that can be published decreases. “The truth is that it is quite expensive to make collections very, conveniently searchable,” Jake said. “But it is still worth doing. In fact, we want to do it faster.”

The Church of Jesus Christ of Latter-day Saints, FamilySearch owner, has been indexing for a long time in one way or another.

  • 1922 – Church employees started extracting information for the TIB, an early predecessor of the IGI. [I added this bullet point.]
  • 1961 – Church employees started extracted names from historical records at Church headquarters.
  • 1977 – Church members started extracting records at Church buildings via the stake records extraction program.
  • 1986 – Church members began the family records extraction program (FREP) which used data entry by members using home computers.
  • 1994 – The stake and family record extraction programs were consolidated.
  • 2006 – FamilySearch began using the current FamilySearch Indexing tool, utilizing Church members and the general public.

Jake showed the current application, FamilySearch Indexing. He then showed the new, browser-based tool that is now being rolled out. The tool allows the data entry pane to be positioned in various places, such as the left of the screen or the top. One data entry mode, when field positions are well defined, allows data entry overtop of the image itself.

Jake showed statistics of the number of indexing volunteers since 2006.

FamilySearch Indexing Volunteers, by Year

This graph should be encouraging to everyone. Compare this to the number of people—probably 15,000—indexing the 1880 census some 20 to 25 years ago. The big explosion in indexers in 2012 was because of the 1940 census. This year FamilySearch is on track to have more volunteers than for the 1940 census project.

“It is a wonderful, exciting program to be a part of. You have the satisfaction of knowing that you and 350,000 of your closest friends are all working together to make documents more usable,” Jake said.

He compared the indexing project of the 1880 census to that of the 1940.

1880 US Census Index

1940 US Census Index

Index only

Index and images

56 CD-ROMs

Web

50 million records

132 million records

17 years to complete

150 days

But this amount of indexing is not good enough.

“Do the math,” Jake said. FamilySearch captures about 150 million images in the field each year. FamilySearch is also scanning the microfilm out of the Granite Mountain Record Vault. This year FamilySearch expects to scan about 300 million images. On average there are four to five records per image. That amounts to about 2 billion records digitized just this year. And Jake expects to have the same amount next year. But we are only indexing about 250 million per year. That’s only 12% of the records “brought in the door.”

“Now do you see why I say, we’re not going fast enough?”

Additionally, FamilySearch is trying to increase the number of cameras. Then they will be even more in the hole. Since 90% of indexing is English, the situation is far worse for other non-English records.

“If we want genealogy records to be more helpful more quickly to more people, we need to look at other ways of indexing,” Jake said. He spoke about three ways that might accelerate the number of indexed records: efficiency, collaboration, and computerized assistance.

Tomorrow I’ll report on what Jake said about increasing efficiency and using collaboration. Thursday I’ll finish reporting about his presentation.

Thursday, August 20, 2015

Lisa Elzey #BYUFHGC Presentation, Part 2

Yesterday I wrote the first part of my report about Lisa Elzey’s presentation at the 2015 BYU Conference on Family History and Genealogy. She titled her presentation, “Ancestry.com: How the Records Tell the Story.” Today, I continue with part 2.

3. Analyze the Details

Timelines, dates, and historical events can be used to analyze the details.

Lisa used an Excel spreadsheet for one example timeline. She had columns for dates, places, comments, and sources. It looked like some of her sources hyperlinked right to the sources. The new Ancestry has a built-in timeline which can be helpful.

Analyze why dates are important. Compare to calendar events, major holidays, community events, and seasons. Compare dates to those of historical events. In the new Ancestry, Life Story includes historical events.

Always ask yourself why your family members did what they did. Look for changes, such as disappearance, immigration, change in economics, and first-time occurrences such as literacy and property ownership. Look for differences, such as religion, ethnicity, language, economic standing, race, and age. As an example, Lisa showed the family of John and Tersa Flynn in the 1900 census in Seattle. John was a master mariner. His oldest son, George, was born at sea. The next two children, Maud and Marguerita, were born off the coast of Peru. Next, Evelin was born at sea, Edeth in Calcutta, and Henry at sea. The last child, Grace, was born back in John and Tersa’s native England. Do you see what probably happened? Apparently, he took his entire family on ship. Eventually they went back to England before Grace was born. From there they retired in Seattle at a time when many from England were going there.

Jno Flynn and family in the 1900 census courtesy Ancestry.com

4. Tell the story

Lisa showed us a case study about Leland Wright who appears in the 1930 U.S. census in Miami, Dade, Florida with his family, Leath, Juanita, Cora, George, and Roy. If I recall correctly, the case study arose out of what initially appeared to be a simple question from a user. And if memory serves, the question was: What ever happened to Leland? What appeared to be a straightforward question evolved into a tale so fascinating, Lisa is writing it up. Watch for the story coming soon to the Ancestry Blog.

Leland Wright and family in the 1930 U.S. census, courtesy Ancestry.com

5. Purpose + Audience = Project

Lisa quoted from the results of an Emory University study. “Children understand who they are in the world not only through their individual experience, but through the filters of family stories that provide a sense of identity through historical time,” says the study. (See “Children Benefit if They Know About Their Relatives, Study Finds,” Emory University [http://www.emory.edu : accessed 15 August 2015], path: News & Events > News Releases > 2010 > Archives > March. The link to the paper is no longer functional. See a PDF copy at “History Relevance Campaign,” Public History Commons [http://publichistorycommons.org/history-relevance-campaign : accessed 15 August 2015], hotlink titled “‘Do You Know…’ The Power of Family History in Adolescent Identity and Well-being.”)

6. Share the Story

Lisa had a video conference with the great-grandson of James Wright and shared the documents and what she had learned about the Wright family. The results were pretty touching. Watch for Lisa’s blog article to hear the rest of the story.

To ask for a copy of the flyer from Lisa’s class, write conferences@ancestry.com. Some Who Do You Think You Are? episodes are available for free on the WDYTYA website. Seasons four, five, and six are available for purchase on YouTube or iTunes.

Wednesday, August 19, 2015

Lisa Elzey Talks WDYTYA and Story Telling - #BYUFHGC

“A lot of the questions I get about my job are about how we do the research for WDYTYA,” said Lisa Elzey at the 2015 BYU Conference on Family History and Genealogy. In her presentation, “Ancestry.com: How the Records Tell the Story,” she not only shared some of the details, she shared how we could apply the principles in our own research. Lisa explained the process which Ancestry employees lovingly call the “Who Do You Machine.”

Ancestry.com uses a "Who Do You Machine" to crank out an episode of WDYTYA.

Lisa Elzey teaches a session at the 2015 BYU Conference on Family History and Technology.Casting is not done by Ancestry, but by Shed Media and The Learning Channel. Some stars come through referrals. You may have noticed that sometimes when a star is featured, you’ll see a costar or a friend in a later episode. For example, Kelly Clarkston is the daughter-in-law of Reba McIntire.

Some celebrities know a lot about their ancestors and some know very little. Once stars are selected we start building their tree, Lisa said. They use all of the basic records that can frame a story. Notice on the machine diagram, below, that some stars fall out. Sometimes it is because of scheduling. Sometimes they’ll come back later. Sometimes the research doesn’t get past a certain point.

After we get a solid foundation of a tree, we start exploring it, Lisa said. We look for compelling stories, such as Christina Applegate’s story about her father. “Beautiful episode,” Lisa said. Once we’ve found what we think is a compelling story, we start crafting the story together. To fill 42 minutes we need about 17 documents.

Once the story is done, then we film, she said. This can be tricky if the star is in a current project.

When that is done, you get the awesome show, Who Do You Think You Are.

You can use the same model as the Who-Do-You Machine to tell your own story.

1. Do the research.

  • Use primary source material whenever possible to authenticate your story. It’s like the difference between fresh peas and green pea soup.
  • Use census records. They create an arc of an individual’s life. They give you potential story clues. Plus, they are easy to find and use.
  • Use birth, marriage, and death records. They help establish relationships and give you even more potential story clues.
  • Then take a deeper dive. Use records such as pension files, newspapers, city directories, grave stones, deeds, probate (Ancestry has a huge collection coming out quite soon), histories, [and many more that I didn’t write down quickly enough].
  • Research complete families. It is like having trees versus poles. Everything that happened in that family affected your tree. I have found amazing stories about my ancestors by researching their entire families, Lisa said.

2. Gather and organize your information.

You will hurt your ability to find stories if you aren’t building a family tree. You also need to keep a research journal and keep a simple documents folder. Adopt a naming convention for images, such as: surname-first name-birth year-document year-document type. If you’ve inherited a messy stack of research, start over, realizing you’re not starting from scratch. Learn about and use the Genealogical Proof Standard. I recommend Tom Jones’s book, Mastering Genealogical Proof, she said.

Lisa uses the Ancestry Shoebox app. When she visits a relative, if she sees a photograph see doesn’t have, she takes a photograph. It’s easy to add notes and attach it to your tree.

The new Ancestry website has a new Sources column. It makes sources easy to see. Clicking a fact shows visually what sources are attached to that fact. It also has a new Notes tool panel. It’s helpful for abstracts, journaling, and many other functions.

As an aside at the beginning of her presentation, Lisa mentioned the What’s New or Updated collection list on Ancestry.com. At the top of the list was “U.S., Social Security Applications and Claims Index, 1936-2007.” A nice thing about the page is that along the side it lists what collections are coming up soon. Keep going back, because Ancestry is constantly adding new records.

Tomorrow I’ll continue my article about Lisa’s presentation.