Thursday, October 23, 2008

Ancestry's Ranked Search

-- Updated 9-November-2008 with my apologies --

GNW writes,

I don't like any of the searches at Ancestry.com.  It takes too much time to weed through all the results that have nothing that connects to your search.  If you put in a name, dates, family members and they lived in that same county and state all of their lives, married there, and then died there, why should they start out with people who lived 1,000 miles from that location and was born 30 years after that person died?  That is unforgiveable [sic] and simply put, STUPID.

Let me put together the reasons why this happens and tell you if something is being done about it.

Television's Dr. HouseEveryone needs a good search strategy

Ancestry's Relevance Ranked search works pretty much under the same assumptions as television's Dr. House:

  • Everybody lies
  • Everybody screws up

(Before I get into my discussion of ranked searching, let me say that if checking the Exact search box in the new search interface doesn't work as expected, you need to inform Ancestry.com. Find a current discussion on New Search on the official Ancestry.com Blog and leave a comment.)

The faulty world

Put in the context of genealogical research, Dr. House's philosophy translates to, "take nothing for granted." Take for example, a census record. On any given page of the census somewhere you can find with 95% certainty at least one of the following faults:

  • The census forms, questions or process gathered imprecise or ambiguous information.
  • The respondent gave the enumerator incorrect information or avoided him altogether. Concepts of exactness in spelling and dating have not always been as strict as today, so the spelling of names could vary wildly. Neighbors were sometimes called upon to give information for those not at home. Respondents sometimes gave information for far away relatives they feared might not be counted.
  • The enumerator wrote down incorrect information or didn't record everything and everyone that he was supposed to do. Sometimes fraudulent names and data were added.
  • Often, a second copy of each census schedule was hand copied, introducing inadvertent errors. Sometimes, these copies are all that have survived for use today.
  • While using the census records for their original purposes, names and information were overwritten, making some information illegible, some inconsistent with other information on the page and some incorrect.
  • The census records were not always properly conserved and might no longer be legible or even extant. As ink fades, the lighter strokes of cursive handwriting can change the apparent spelling of names and places. Some were microfilmed out of focus and then the originals destroyed.
  • The information on the census was incorrectly abstracted (i.e., extracted or indexed). Or one or more names or pages were skipped. Sometimes information vital to the interpretation of a census entry was written outside the normal fields or the abstraction software was not capable of capturing it.
  • The electronic search index includes errors making some records impossible to find. It might exclude some names or groups of names. Sometimes information is incorrectly indexed because of faulty standardization or handling of abbreviations, names, dates and places.
  • Sometimes you, the user, make typographical errors when typing information into search forms. And sometimes the targets of our searches show up in unexpected times and places.

A similar list can be produced for other types of records. Simply put, people screw up. A good searcher takes each of these errors into account and devices a search strategy accordingly. Have you ever used a successive term-dropping round-robin search to find a misindexed name? (Drop the first name, then the middle name, then the last name.) Have you ever used the successive term-dropping technique to find a person when you only had a vague guess about their location? But strip away the romance of performing dozens or hundreds of searches for one target record and the search strategy is pretty consistent. And pretty repeatable. And pretty mundane.

The ideal world

Wow! That's exactly what computers do better than humans. Lots and lots and lots of redundant tasks. So let's program the computer to do the ideal search strategy for us. I'm talking about the ideal world here, for a moment. Neither Ancestry.com nor anyone else has it right... yet.

Don't make me try all the nicknames, or even trust me to know or remember them all. Don't make me study out all the common name spellings. Don't make me study historical linguistics to find out how German pronunciation would affect phonetic name spellings. Let some expert somewhere do it once and let us all benefit from it. Don't make me explicitly search the census for family members to try and find my guy. The computer has my tree; do that search for me. Don't make me do successive term-dropping to account for the faults from the list above. Do it for me. Don't make me figure out every different name that a location was ever known by. Look them up and try them all for me. Hey, and while you're at it, can you account for common transliterations and other typos?

The real world

I'm happy to announce that Ancestry.com has been working on just such a feature for several years now. Some of the kinks are worked out. Some are not. It is called Relevance Ranked searching.

  • The reason you get results 30 years after the death date is because the death date you entered might be wrong or the death date on results listed might be wrong.
  • The reason you get results 1,000 miles away is because a location might be wrong.
  • The reason you get results with different names is... well you get the picture.

So it is entirely normal to get results that don't match all of your criteria. That is by design. It is entirely normal to get way too many results. They are sorted from best to worst. Look through the results until your superior brain says, "I've reached the point where the quality of the results is less than what I am willing to wade through." Then let your superior brain zero in on a particular record collection or database. Or change the search criteria. Click the exact box on selected items. Then try another search. Gradually release the autopilot and take greater control of the search. But do it after you've let the ranked search take its best crack at it.

Ancestry.com has stated that they think their current algorithm has a big problem: it ranks results by how many search terms match but doesn't penalize non-matches. Kendall Hulet discussed that here and Anne Mitchell brought it up again in this comment. Will they be able to fix this problem?

What does your brain do differently when it says, "poppycock, that's not a match!" versus "There he is! In Kansas?" If they can figure that out, then they can fix this problem.

House's boss, Dr. CuddyDear GNW

I hope that explains why you get ranked results that don't match the input criteria. As you can see, that is sometimes good and sometimes bad and as I mentioned, Ancestry.com has plans to improve this.

Give me the "name, dates, family members" that you typed into the search form. You said they "lived in that same county and state all of their lives, married there, and then died there." If I understand you correctly, you say that the very first results "start out with people who lived 1,000 miles from that location and [were] born 30 years after that person died." Send me the example and I'll make certain it gets to the right people.

Oh, and please don't read through all 24,521 results of a ranked search. When you get that many results in Google you say, "Wow! Google's awesome." But you don't try every single result.

Lastly, I'd like to remind everyone that providing Ancestry.com with detailed, actionable examples is essential to communicating your complaints. Above all, avoid unfounded emotionalism as it distracts from the real problems in New Search.

Thanks,
-- The Ancestry Insider

4 comments:

  1. Dear Insider,

    Your posting of Oct 23rd, "Ancestry's Ranked Search" has my Irish up, so here is my reply to your challenge to "Put up or shut up." I'd be obliged if you post it in it's entirety.

    You want details? Below (lightly edited & updated) is a summary of relevant New Search discussion threads that I put together on September 3, 2008 for one of product manager Anne Mitchell's posts on the Ancestry blog.

    Follow the hyperlinks, read through the comments, and you will find copious, detailed, explicit observations and examples of the inefficient, illogical and inept Ancestry New Search engine. For more recent discussions, see any of the more recent blog threads moderated by Ms. Mitchell.

    And by the way, we ARE paying attention, and it's not US that looks stupid. New Search has seriously flawed search concepts and search engine algorithms and even Ancestry—finally—admits it. On October 3, Anne Mitchell commented in her most recent thread http://blogs.ancestry.com/ancestry/2008/10/10/hot-keys-in-the-new-search-user-interface/#comments :

    "You can think of search as being two pieces: 1) the algorithms that find who you are looking for and 2) the interface you use to tell the search engine who you are looking for. Often, #2 looks like lipstick on a pig. But we continue to work on it. BUT, we are also currently working on #1 [the search algorigthms] as well. In fact, we just brought on a couple of new engineers to add to our current team just to work on our algorithms. Now this stuff tends to be unpredictable — sometimes the best ideas in practice are bad, and sometimes they hog too many resources. So I can’t tell you when you will start to see the fruits of these labors…but they are coming. And while all of this is being done, we are also continuing to work on #2, or how you tell the engine who are looking for. So look for some smaller changes over the short haul; and some bigger ones as time goes on."


    Some of us have been commenting—in detail—on this issue since March, 2008. If you are serious in discussing the issues involved—rather than insulting experienced, committed and frustrated Ancestry subscriber/researchers who have substantial, documented complaints about New Search—then read the 400+ comments on the links below and get back to me and your other readers. (Not interested in reading all the posts? Focus on those from Jerry Bryan, Tony C, Jade, Carol and myself, among others, for starters.)

    New Search is deeply flawed, and your readers deserve better than your Oct. 23rd snark.

    Disappointedly,
    —Reed

    P.S. For what it's worth, the New Search GUI has been made somewhat less annoying by the addition of several useful hot-keys. (Which actually work!) Follow the link above for details.

    ***********************************

    Anne,

    […] I thought it might be productive to step back and look at the fundamental concept: Is Ancestry’s New Search an improvement over its predecessor and does it work well?

    From time to time you (and other Ancestry representatives) have asserted that New Search is indeed an improvement and will eventually replace Old Search completely. Further, it has been said that Ancestry’s research shows most patrons are pleased with New Search and that it meets their needs better than Old Search.

    Since I don’t agree—and I don’t have access to your survey data—I decided to go through some of the relevant blog threads from the last several months and see if there are any trends. Now I admit that this is not a scientific survey, but it does represent an important Ancestry constituency, namely Ancestry.com users who (1) use the product often (2) care about the quality of their genealogical research and (3) take the time and energy to follow the relevant blog threads and post detailed, carefully considered comments.

    So here are the results. I give the subject, URL and date of the blog thread, and the total number of readers’ posts. I note the number of responses from Ancestry company representatives (not including any introductory essay or posts). Finally, I made a subjective judgment to qualify the customer/blogger responses to New Search. I didn’t want to spend all of my free time on this, so I used these simple criteria:

    •POSITIVE reader responses may have some suggestions about improving New Search, but the general tone is “I like New Search better the Old Search and I get better results.”
    •NON-POSITIVE reader responses indicate mild to passionate dislike for New Search and/or focus on problems with New Search that need correcting. May include passing positive remarks.
    •MISC. reader responses include blog posts that
    (a) discuss the merits of various methods of search but not mention New or Old search specifically or
    (b) do not refer to the blog thread topic or that are
    (c) too ambiguous to call Positive or Non-positive.

    From May 29 to Sept. 3, 2008, Ancestry posted 9 blog threads related to New Search.
    •Total readers’ posts: 473
    •Total POSITIVE readers’ responses: 12
    •Total NON-POSITIVE readers’ responses: 381
    •Total MISCELLANEOUS readers’ responses: 64*
    •Total Ancestry responses: 23

    *re MISC. readers’ responses: a substantial number REALLY didn’t like New Home Page, FYI.

    Details follow below.

    Statistically yours,
    —Reed

    P.S. I worked very hard to avoid math and statistics classes in school (and largely succeeded). Please feel free to check the math and tally the original blog responses.
    —R.

    P. P. S. It seems all my formatting indents and underlinings have disappeared when pasted into this comment box. My apologies if you find the layout difficult to read.
    —R.

    ****************************

    “Finding Levi S. Baker…”
    Aug. 25–4 Sept. 2008 (open thread)
    http://blogs.ancestry.com/ancestry/2008/08/25/finding-levi-s-baker-or-how-to-use-new-search-interface-to-find-him/
    Total readers’ posts: 69
    Ancestry responses: 4
    Positive: 0
    Non-positive: 69*
    Misc: *some of the 69 posts counted as “non-positive” are not necessarily critical of New Search, and may focus more on the technical/philosophical questions of what an “Exact Search” should represent; they are certainly not “Positive,” however.

    “Specific Database Search: Old Search UI vs. New Search UI”
    http://blogs.ancestry.com/ancestry/2008/08/18/specific-database-search-old-search-ui-vs-new-search-ui/
    Aug 18–Aug 25, 2008
    Total readers’ posts: 77
    Ancestry responses: 9
    Positive: 1
    Non-positive: 60
    Misc.: 7

    “The New Search Interface”
    http://blogs.ancestry.com/ancestry/2008/08/04/the-new-search-interface/
    Aug 4–Aug. 25, 2008
    Total readers’ posts: 78
    Ancestry responses: 5
    Positive: 0
    Non-positive: 67
    Misc.: 11

    “Let’s Talk About Search!”
    http://blogs.ancestry.com/ancestry/2008/08/01/lets-talk-about-search/
    Aug. 1–Aug. 6, 2008
    Total readers’ posts: 47
    Ancestry responses: 2
    Positive: 2
    Non-positive: 40
    Misc.: 5

    “Announcing New Search Webinar”
    Jul. 18–Aug. 2, 2008
    http://blogs.ancestry.com/ancestry/2008/07/18/announcing-new-search-webinar/
    Total readers’ posts: 46
    Ancestry responses: 0
    Positive: 2
    Non-positive: 32
    Misc.: 12

    “Continuing the Dialogue about the New Search Experience”
    http://blogs.ancestry.com/ancestry/2008/07/11/continuing-the-dialogue-about-the-new-search-experience/
    July 11-Aug. 9, 2008
    Total readers’ posts: 59
    Ancestry responses: 0
    Positive: 4
    Non-positive: 39
    Misc.: 11

    “The Ancestry Insider Discusses New Search”
    http://blogs.ancestry.com/ancestry/2008/07/08/the-ancestry-insider-discusses-the-new-search/
    Jul. 8–25, 2008
    Total readers’ posts: 12
    Ancestry responses: 0
    Positive: 1
    Non-positive: 10
    Misc.: 1

    “More on Ancestry.com’s New Search”
    http://blogs.ancestry.com/ancestry/page/4/
    Jun. 29-Aug. 2, 2008
    Total readers’ posts: 48
    Ancestry responses: 3
    Positive: 2
    Non-positive: 30
    Misc.: 14 (including many complaints about New Home Page)

    “New Search is Available for Everyone”
    http://blogs.ancestry.com/ancestry/2008/05/29/new-search-is-available-for-everyone/
    May 29–Jun. 30, 2008
    Total readers’ posts: 37
    Ancestry responses: 0
    Positive: 0
    Non-positive: 34
    Misc.: 3

    ReplyDelete
  2. Thanks to Reed for emphasizing the scope of user views of New Fuzzy search as on Anne's blog.

    google/blogger still will not allow posting through its id/interface. . . grrr (engineers working on this for days now)
    Insider, you don't want to try to find the test cases on the various blogs about New Fuzzy search interface?

    Try these and see what you get.

    1) William Poynter or Pointer
    born about 1725 plus or minus 5 (eldest known child for whom a rough birth-year is retrievable was born probably ca. 1748)
    Estate probated 1781, Sussex Co, DE Court of Common Pleas

    As ranked search in New Fuzzy interface, surname 'Poynter' yields technically correct but useless Tree entries (where information is accurate it is lifted from my research and posts), as match, with note that following myriad listings are less likely to be this man but could be helpful.

    As ranked search in Nwe Fuzzy interface, surname changed to 'Pointer' yields as best match the Cook Co, IL death index for 1908-1988.

    Neither surname spelling retrieves the sole pertiment item in Ancestry's database for this person in de Valenger's Calendar of Sussex County, DE Wills and Estates.

    2) Ranked search in New Fuzzy interface, Daniel Davis
    b. 1760 plus or minus 5, place Sussex County, Delaware (I want it to retrieve one of the four actually relevant items in the Ancestry database, a garbled-by-contractor-and-ancestry version of a Marriage Bond)
    d. 1816 plus or minus 2, place Monongalia County, West Virginia

    With results sorted for relevance there are these initial "Matching" items:
    a) Family Tree lifted from data I have researched and posted or that others have posted from my research shared with them
    b) Photo of Davis famiy crest (of course people b. in America do not have any legitimate heraldic crest, and Ancestry programmers believe same surname = same family)
    c) Entry from American Genealogical-Biographical Index. This is relevant to this person, although the actual entry (which Ancestry does not have) is factually incorrect - it is an erroneous interpretation of an 1804 muster roll that was published in DAR Magazine.
    d) Public Member Stories, entitled "Will" attached to one Caleb Davis (1746-1820). This Caleb Davis did have a son Daniel named in the will, but not this Daniel Davis. Actually relevant to this person in the opinion of those who make this association, albeit factually incorrect.
    e) Private Member Stories entitled "Daughters of the American Revolution". Not relevant to this person
    f) Public Member Photos, another Davis Family Crest

    The rest of the entries 'sorted by relevance' as far as I am willing to look include none of the other three items in the Ancestry database that are actually relevant to this person by first-name, surname, time and place.

    Sorted by Category of Record, after tedious paging through all the entries ordered by number of hits, in the 'Court, Land Wills, Financial' category I can find at the end the entry for this person in Johnston's mis-titled _West Virginia Estate Settlements . . . to 1850_, which has all of the 'Davis' entries highlighted. This database appears to have been indexed by keyword instead of by first-name/surname. Possibly the locations were not indexed. At least New Fuzzy interface now retrieves this item. It didn't used to.

    New Fuzzy interface does not find these relevant items by name, dates and places:

    I) 1789 marriage bond in Delaware Marriages database, extracted by contractor from a Delaware Public Archives card file and called Marriage Record (which it is not) and of course all the items from this databae omit the source information given in the card file.
    II) 1810 US Census entry for a 'Dan. Davis' in Monongalia County, (West) Virginia

    It does, however, retrieve millions of items completely irrelevant as to full name, time and place, nearly all of which are listed *before* the relevant items found when the search is sorted by 'record group'.

    Exact Search for both people returns no results.

    I hate it that Trees are called Records and grouped with published books. County Histories can be valuable sources where there are first-hand accounts.

    I don't normally do global searches, and will not recount what happens with Old Search concerning the above two persons. The core problem here is that Ancestry intends to eliminate Old Search interfaces entirely, including specific access to databases via Card Catalog (for which the New Fuzzy version is useless and if you are not careful you get into a wormhole). It badly wants to reduce the number of search interfaces without fixing the chaotic array of poor and partial database indexing. It badly wants to eliminate the very useful items such as easy-to-scroll-through lists of Census databases and Military Records databases - and, yes, the easy to use key-word searchable Card Catalog format. Ancestry wants to do this without making fundamental changes to the creaky old search engine platform. It badly wants to go to the New Fuzzy interface page format so it can add more advertising.

    So the key problem is to make New Fuzzy interface at least as useful as Old Search. On the cheap. If it can't be fixed, Ancestry will do all the above anyway, and required search time for customers will increase by one to three decimal points. Perhaps it has occurred to someone to charge for access by the hour instead of by month or year.

    It is not a good sign that the New Guy In Charge comes from eBay, which has a very good Boolean-enabled search engine, yet is trying to force users to go to a new 'best match' results format instead of such options as 'newly listed', which is what frequent searchers for particular types of items want.

    ReplyDelete
  3. A.I., you really do need to get back on your meds. Your condescending, I'm-smarter-than-everybody-else, "you shut up - no, YOU shut up" attitide doesn't play well with your readers.

    And if you're going to claim that it's all tongue-in-cheek, maybe you should consider one more job change. The folks at Saturday Night Live might appreciate your demeaning and degrading humor. It certainly doesn't wear well on a Church employee.

    ReplyDelete
  4. Thanks so much for implying that everyone who doesn't like the new search is just being stubborn and couldn't possibly have real, relevant and meaningful reason to back up their opinion. Like a previous commenter stated - go look at the ancestry.com blog on the subject. There are a LOT of very detailed examples of what doesn't work.

    And no, I'm not giving a detailed example - because there are a lot more folks who've already done so, and done a very good job at it. Go read those and respond before you go off on everyone.

    Honestly, on a subconscious level your prior employment with ancestry might be influencing your view. You know how the company thinks and plans. You view things from their side - which is fine - that's what this bog is about. But, it doesn't seem like you know how to relate to the customer side anymore. You want to force something on "your" customers that they don't want, patting them on the head and saying "we know what's best for you, you silly little things" Well, you don't - so stop lecturing and start listening.

    ReplyDelete