Wednesday, February 8, 2012

What Does it Take to Get a Good Result from Ancestry.com

Ancestry.com vice president Tony Macklin once told me that genealogists like to have control of their searches. They liked exact search because they knew what the search engine was doing. Ancestry subsequently improved the, so called, New Search to give users more control of the search, adding to individual search parameters control of exact, phonetic, geographic matching, and so forth.

But advanced users have still disliked Ancestry’s ranked search results. What was going on under the covers? What determined which matches were listed first? The lack of transparency bothers us. Inquiring minds want to know.

That’s why I enjoyed “What Does it Take to Get a Good Result: The Inner Workings of the Ancestry.com Search Engine” by Ancestry’s John Bacus. We learned more about how ranking is done. Bacus is the principal product manager for Search.

The Ancestry.com simple search form

The Simple Search form, shown above, starts with name, approximate birth date, and a location. If you know an age instead of a birth date. a calculator is available. The location doesn’t need to be the birth location. It can be birthplace, death place, or any place in between. For slightly more advanced searches, use one or more life event or family relative containers.

Relevance, Ranking, and Scoring

Results are ranked and sorted in an attempt to present the most relevant results first. This is done by calculating a score for each result, and listing the highest scoring records first. The more a record matches the information you enter in the search form, the higher its score. It’s like grading a test, where each record is a student’s test and the information you type in the search form is the answer key.

However, not all fields score the same. The last name field scores the most points, followed by first name, locations and dates. All other matches score the least. An exact match scores more than otherwise, such as a phonetic match. However, using the Exact search option doesn’t change the score; it merely excludes records that don’t match.

News to me was this: the more fields you enter, the higher the possible score. It’s not like a test where the score always adds to 100. Instead, the total possible points depends on the total number of fields you supply. In mathematical terms, the result is not normalized.

This has an interesting effect. A mediocre match can score higher than a perfect match, if a lot of information is specified in the search form.

Different record collections have different amounts of information, as if the length of the test passed out to each student is a different length. A census collection has a lot more questions than a birth record collection. It’s possible to enter so much information in the search form (which is like the answer key), that a mediocre-matching census record might be able to score higher than a perfectly matching birth record, simply because it has more opportunities to score points.

I’ve always advocated tree-based searching. And I’ve always advocated entering as much information as you can. I still recommend doing that as a first, shotgun search. Then you need to take a rifle shot at specific record types. A census record might list brothers and sisters, but a military draft record doesn’t. If you want to take a shot a a military draft record, don’t enter names of siblings. Does that make sense? You don’t want census records to score higher on the test than draft records.

Search Categories

The problem with entering less information is that you might not enter enough information to uniquely identify your ancestor. What do you do if you wish to find an ancestor in a city directory, a phone book, a voter list, or a yearbook. Entering a common name may return results from thousands of record types for which you have not interest. Some other method must be used to cut down the number of results. One possibility is categories.

Categories are accessible in several places. Among other ways, categories can be used to filter search results. The categories are listed along the left edge of the results. (Below, left.)

The Ancestry.com category filters  The Ancestry.com passenger list category search form

The search form is automatically created based upon which fields are present in the category’s record collections. Passenger lists might include fields such as arrival and departure. (Above, right.)

Sometimes indexed fields are not included in a search form, such as ship’s name in the passenger list form above. Use the keyword field when this occurs.

Advanced Form

“One of my objectives coming out of here,” said Bacus, “is that all of you will use the advanced form.” It has many advantages. “Additionally, it makes you cool.”

[Note to my readers: As I write this, I’m getting pretty sleepy, so don’t be surprised if something hereafter doesn’t make any sense.]

One choice available on the advanced form is Exact. Setting a field to exact excludes records that have different info for the respective field. And it excludes records that don’t have or index the respective field. For example, requiring family members will exclude US censuses before 1880

A number of options are available for searching with names. Regardless of how many fields are included in the search, the only thing that is required to match is last name. Click on “use default settings…” to change options.

When using Exact name matching with multiple first names, they can math in any order, but must all be present.

Exact scores the most, then similar, then soundex/phonetic.

For dates, dates within a specified range are scored the same no matter where the date is in the range. However, if a date is outside the range, the score gets progressively lower the further it falls outside the range. To give a little extra bump to the score for a particular range, add an Any Event life event container, without exact. However, the more search fields you’ve used, the less the effect.

Ancestry employs lifespan filtering to remove clearly extraneous results. Specifying just a birth year limits results to –5 and +102 years. Entering just a death, limits dates to –105 and +2 years. Entering both limits dates to birth-5 and death+2. Lifespan filtering doesn’t work for record collections with implicit dates, such as a census date.

Speaking of censuses, the 1930 census has a bunch of people who were (incorrectly) listed as age 120. When birthdates were specified, lifespan filtering inadvertently filters these out. To counter this bug, specify a death date also, many years afterwards.

For locations, the highest scores are given to a city match, then county, then adjacent counties, then state, then adjacent state, and finally, entire country.

Like FamilySearch, the Ancestry list of places contains mostly modern place names. if a place doesn’t exist anymore, use smallest matching location, then add the specific historic place name in the keyword field. [Note to self: There’s a lot of Washington, Iowas. Note to readers: still sleepier. I’m trying to hold it together to get this posted tonight. Note to self: FamilySearch has a long, long, long way to go to catch up to Ancestry.]

Miscellaneous

Specifying a gender excludes the other gender, but not those without gender.

Specifying a keyword searches all indexed fields, whether or not they are shown on the search form.

When Collection Priority is not specified, a country is favored matching which Ancestry website you are using. If you specify collection priority explicitly, it changes scoring.

Some of these options are “sticky.” That is, they are remembered. Last name filters, collection priority, and Restrict To are sticky.

When using wildcards, the first or last letter can be a wildcard, but not both.

Regardless of the Restrict To setting, tree results are not mixed in with record results. However, if a really great tree result is found, it will be listed before all record results.

Note to readers: I made it! I’m done! I’m sorry I wasn’t able to take the time to make that last stuff intelligible. Please imagine that it was.

Oh, and play like I came up with a pithy conclusion.

Good night.

4 comments:

  1. Thanks for the information - it all helps.A question about location in old records. For example. when an area was not a state, are the records listed as to where the, for example, birth was originally listed. Parts of Indiana, and the middle west were originally virginia, and only later did they piece down into what is known today as Indiana, or Illinois. n those historic days, would the records be found in Virginia, or with the modern state designation, do you think?

    ReplyDelete
  2. blgdiva, records usually remain in the repository of the entity which had (or claimed) jurisdiction when the records were made. For example, during the period when VA claimed part of southwestern present PA, it made some land grants there, and the warrant, survey and grant records are in the Library of Virginia's Land Office collection.

    In the past 75 years or so there have been some intergovernmental co-operation agreements, and the Library of Virginia has made some materials available to other States, such as KY and WV. Also some States have acquired copies of records microfilmed by LDS. So in some cases access to copies of such records may be found on websites (such as KY Secretary of State's) or in State Libraries or Archives (such as West Virginia's in Charleston).

    However, you will not find images of any such early records on Ancestry.com, although sometimes book-published abstracts might be found there -- such as volumes very briefly indexing KY land grants.

    Ancestry.com's emphasis is on obtaining material from 1880 and later, rather than from the 17th and 18th centuries.

    ReplyDelete
  3. Thank you for this information. It confirms what i have been thinking--that less is more when it comes ot searching Ancestry. I know it goes against the grain but when I put every thing in, my ancestor appears on page 27 (50 results per page). I have even gone so far as to reduce my search to exact surname and exact lived in location ( usually a state only or a county only). Then I slowly expand my search parameters. My dream is to be able to specify the search order that I receive my results in. For example, surname within location. When I allow the default and enter Meisberger, I get pages and pages of names beginning with Mc...

    ReplyDelete
  4. Very good summary of the talk. I have a couple of comments. First, the primary identifier with the greatest variability is the name. The one with the least variability is the place, especially if one begins at the most specific government jurisdiction of which one is most certain and then broadens the filter. Traditional best practice in offline collections assumes that a researcher study the jurisdictions to determine place names against time periods. I suspect that in the time it takes to digitize the names of one large local source, Ancestry could imbed such historical jurisdictional data into their search engine. Second, an experienced researcher has some understanding of the types of sources that exist and the content that they provide. The genealogical value varies by source. Assuming that most people interested in their ancestors don't really want to take the time to become expert in sources relevant to their research problem, it is wonderful that Ancestry, Family Search and the other content providers simplify the craft into indexed searches for names. However, the complexities of outsmarting the "scoring" of the system becomes nearly as complex as gaining genealogical knowledge. The consequent high number of irrelevant "hits" creates a great deal of frustration, not unlike searching an unindexed phone book. The inexperienced researcher (the biggest market share) can become frustrated not knowing what he/she has actually accessed, or why the desired individual has not been found. Granted the vast majority of searchers will not become "best practice" genealogists in the traditional sense, why not provide real guidance as to the most likely reason for failure; the fact that the sources with the needed information are simply not digitized yet (provide access to where they can be found away from Ancestry). For me, if I don't get a hit with a reasonable amount of wiggling around name variation, I assume the needed parish register for Dingwall, Ross, Scotland in 1750 isn't in the database. Help the researcher by being straight up and specific about what sources by time and locality are not on the database, and how to find them.

    ReplyDelete