Wednesday, November 18, 2015

Bad Merges in Family Tree

Two cars merge badly in an intersection.I was affected again by a bad merge in FamilySearch Family Tree. Someone had merged a Tilford in Virginia with a Telford in Ireland to produce a monstrosity of a person with a gaggle of children. Mysteriously, this man bounced back and forth between Virginia and Ireland, using the Tilford spelling for all children born in Virginia and the Telford spelling for all kids born in Ireland. Hmmm. What an odd fellow.

I can’t watch all my relatives so I didn’t know the merge had occurred until it was too late. Rather than undo the merge, people had come in and made various repairs. It was no longer possible to undo.

I have often spoken of the indiscretions of FamilySearch’s past, accusing them of bad automated merges. Perhaps because of this I’ve received a message on the topic from one of FamilySearch’s engineers, Randy Wilson.

“Machine merging was done very carefully [when the New FamilySearch tree was created],” he said. “[There is] a lot of empirical evidence showing that errors were less than 0.5%. There have been a lot of bad merges done in the system, to be sure. But almost all of the bad merges I have seen (and I've seen a lot of them) have been caused by users, not the machine.”

I hope he’s right. He’s one of the brightest fellows I know. (And he’s a relative and I’d like to believe intelligence runs in the family.) But there’s something that gives me pause: Ancestral File.

My perception was that Ancestral File was a mess. I thought the first release was especially messy. No one can be blamed besides FamilySearch. Users couldn’t make any changes, good or bad. Only FamilySearch can be blamed.  I think my perception was shared by many genealogists.

“Yes, AF had some bad merges, too,” Randy said. “The actual number wasn't as bad as the user perception, because common ancestors happened more and got merged more and common ancestors also get seen by users more, because, well, they're common. So you get a ‘squaring effect’ of the bad merges being especially visible there.”

Are the software engineers of today somehow smarter than the engineers of the 1990s? Randy Wilson said

There was, in fact, a big difference between the merging algorithms used for Ancestral File…and those used in New FamilySearch. The former used "probabilistic record linkage" with simple field-based features, using parameters derived from statistics on a few hundred labeled pairs of records. The latter used far more advanced neural networks with 80,000 labeled pairs of records, with 20,000 more used to verify accuracy and select thresholds. I don't know if the engineers are "smarter" than in the 1990s, but the algorithms used sure are, and the engineers did have more extensive training in machine learning than before. In particular, we had a few Ph.Ds with a background in machine learning and neural networks working on it this time (me, Dallan Quass and Spencer Kohler).

Frankenstein genealogy results from incorrectly combining recordsIs there something about genealogical data that inherently leads to incorrect merges? Perhaps the ramifications of a merge depend on how clean the data is to begin with.

I can imagine two identical pedigrees, each sprinkled with Frankenstein monsters or extra generations or random pedigree errors. (See my article, “Frankenstein Genealogy.”) A machine algorithm is let loose on the two pedigrees. I can image the machine could become confused or react in unexpected ways.

At the very best, the machine cocks the gun. Then some naïve genealogist comes along and pulls the trigger.


Image credits:
       shuets udono, photograph of auto accident of two cars in an intersection in Japan, Flickr (https://www.flickr.com/photos/63522147@N00/408633225 : accessed 18 October 2015). Used under CC BY-SA 2.0 license.
       “Frankenstein Genealogy,” The Ancestry Insider (http://ancestryinsider.org : 23 March 2011).

16 comments:

  1. I'm going with s "C" as the answer. Computers are not it perfect. Humans really are not perfect. When humans do not pay attention to details it becomes a mess. When humans try to "fix" without understanding "how" the mess becomes bigger.

    ReplyDelete
  2. Sometimes it takes hours to undo a bad merge. One thing I would like to add, since many hands have stirred the pot, don't be too quick to blame the last one in line. I just got chewed on by someone because another person merged my New York Ancestor UNKNOWN with someone in Georgia. Like you, I don't watch all my ancestors and this came as a surprise to me. As a peaceable person, I only growled a little and spent some time sorting it all out. (They sure could make the changes list a little more user friendly!) I'm back to Ancestor UNKNOWN and Georgia is back in Georgia.

    ReplyDelete
  3. Ancestry makes the same bad merges and I don't understand how to undo this mess? Carl

    ReplyDelete
  4. At one time I was incorrectly linked to Mary Jones (LHC6-STW) but I have since put a watch on her to witness the insane activity of this poor departed soul. This is a mess that no one will ever fix.

    ReplyDelete
  5. Interesting article. I have not had to deal with bad merges (thankfully), but have seen them when working with members. I do believe that algorithms are built by engineers. Therefore, the issues that we had do not seem to be going anywhere.

    I believe that if members understand that it is not a race to the temple, and to stop calling those they are taking 'names', we would have less of an issue with bad merges.

    Very much enjoy your articles.

    ReplyDelete
    Replies
    1. You are an enlightened individual. FINALLY some one else is willing to say, "THE EMPEROR has NO CLOTHES ON!" When the LDS church takes on a serious stance about teaching their members HOW to search for ' names' not 'just do it' there will be much better results. I speak as a volunteer at a near by Family HIstory Center who has watched this happen repeatedly with patrons who we have the responsiblity of 'training the children' for the PARENT. Good luck until then...

      Delete
    2. Start a club and I'll join it :) I do wish more people had the realization that Family History is more and requires more than a handful of hours, the click of a few buttons, and a single record.

      Delete
  6. I have seen some completely insane trees in my time--I am especially fond of whomever decided that it was reasonable to put "Canada, Indiana, USA" as the birthplace--still unknown to me--of my six-times great grandfather who was born in the 1740's or so--long before either Indiana OR the USA even existed, and I am still scratching my head about the "Canada". So the thought of being on a communal tree with other people making the decisions about what's reasonable is simply chilling! "Naive" is one word for it. "Ignorant" works sometimes. So does "completely dopey". Though that is two words.

    ReplyDelete
  7. Neural networks need a correct set of data to train on. If they train on a dataset with embedded errors, they will end up producing matches like the incorrect ones they trained with.

    It is an art, not a science, to create a good neural net. Trained people are still better at matching than the best neural networks.

    ReplyDelete
  8. This is why I do not use family search tree. Much prefer Wiki tree because no one can add or change anything without my permission.

    ReplyDelete
  9. I won't make a family search tree. I keep mine at ancestry.com. Nobody can make edits to it unless I give that specific person editing rights. I tried familysearch tree. I got that "person" you pictured. And then it morphed into children of that person. I took my tree down. Never again. It's hard enough having a clean tree without people who think they know what they're doing "fixing" it.

    ReplyDelete
  10. This is why I do not put trees online. People make a mess of them. I only share with someone who contacts me with a serious interest.

    ReplyDelete
  11. I've had to clean up several bad merges, all done by users (not automated), and I've finally given up on FamilySearch Family Tree. As several commenters pointed out, many users seem to be poorly educated on genealogical methods. I wonder if some of the merges weren't done using a program like RootsMagic, and the users just didn't bother to review the existing sources, notes, and discussions. If I use a "Borg" family tree at all, it will be one like Geni.com, which is at least curated on the older, more common profiles.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. FS-FT now has a lot of folks plucking stuff from other trees and installing nonsensical vitals. I just finished correcting a cluster where a person was marrying in Germany while having children born/baptized in Delaware. Some creative individual had changed dates and places, added errors to a daughter's marriage and muddled the supposed father's life-path almost beyond recognition. The individual said all this was a result of "further research," none of which was specified. Alas.

    ReplyDelete



  14. I just received the book Unofficial Guide to familysearch.org and noticed in there that ANYONE can change your tree there...I really find that hard to believe...why would they let just anyone make changes to your database...that does not make sense to me...with ancestry.com getting ready to make us ALL use the NEW version on 15 Dec even though there are still many features that don't work, it makes me ask "what are all these people thinking"?

    Yesterday I tried to merge people in MY tree in NEW and the feature did not work, I had to go back to OLD to do it...I tried to do something else in NEW (I forget what it was) again I had to go back to OLD to do it.

    Every time you add a Census or any other document on ancestry their system automatically puts in the name as an alternate name even though it 100% matches the name already in there.

    These folks are messing with OUR data!!!

    It makes me wonder if they care if anything is CORRECT...

    I care if my tree is correct and I am tired of various sites allowing all kinds of nonsense. With a tree of 10,000 there is no way I can keep everything off line (I got back to the Pilgrims and usually do about 3 generations of every line...and I document the heck out of people...

    There is no way I can move my tree off ancestry.com--just too much info to move--but we are paying GOOD money to them only to have a system that is still not working correctly--and they don't give a hoot.

    I LOVE the fact that ancestry has tons of databases...I LOVE the fact that folks are easy to find on familysearch.org.

    BUT I cannot seem to trust either one with my info...

    ReplyDelete