Ancestry.com vice president Tony Macklin once told me that genealogists like to have control of their searches. They liked exact search because they knew what the search engine was doing. Ancestry subsequently improved the, so called, New Search to give users more control of the search, adding to individual search parameters control of exact, phonetic, geographic matching, and so forth.
But advanced users have still disliked Ancestry’s ranked search results. What was going on under the covers? What determined which matches were listed first? The lack of transparency bothers us. Inquiring minds want to know.
That’s why I enjoyed “What Does it Take to Get a Good Result: The Inner Workings of the Ancestry.com Search Engine” by Ancestry’s John Bacus. We learned more about how ranking is done. Bacus is the principal product manager for Search.
The Simple Search form, shown above, starts with name, approximate birth date, and a location. If you know an age instead of a birth date. a calculator is available. The location doesn’t need to be the birth location. It can be birthplace, death place, or any place in between. For slightly more advanced searches, use one or more life event or family relative containers.
Relevance, Ranking, and Scoring
Results are ranked and sorted in an attempt to present the most relevant results first. This is done by calculating a score for each result, and listing the highest scoring records first. The more a record matches the information you enter in the search form, the higher its score. It’s like grading a test, where each record is a student’s test and the information you type in the search form is the answer key.
However, not all fields score the same. The last name field scores the most points, followed by first name, locations and dates. All other matches score the least. An exact match scores more than otherwise, such as a phonetic match. However, using the Exact search option doesn’t change the score; it merely excludes records that don’t match.
News to me was this: the more fields you enter, the higher the possible score. It’s not like a test where the score always adds to 100. Instead, the total possible points depends on the total number of fields you supply. In mathematical terms, the result is not normalized.
This has an interesting effect. A mediocre match can score higher than a perfect match, if a lot of information is specified in the search form.
Different record collections have different amounts of information, as if the length of the test passed out to each student is a different length. A census collection has a lot more questions than a birth record collection. It’s possible to enter so much information in the search form (which is like the answer key), that a mediocre-matching census record might be able to score higher than a perfectly matching birth record, simply because it has more opportunities to score points.
I’ve always advocated tree-based searching. And I’ve always advocated entering as much information as you can. I still recommend doing that as a first, shotgun search. Then you need to take a rifle shot at specific record types. A census record might list brothers and sisters, but a military draft record doesn’t. If you want to take a shot a a military draft record, don’t enter names of siblings. Does that make sense? You don’t want census records to score higher on the test than draft records.
The problem with entering less information is that you might not enter enough information to uniquely identify your ancestor. What do you do if you wish to find an ancestor in a city directory, a phone book, a voter list, or a yearbook. Entering a common name may return results from thousands of record types for which you have not interest. Some other method must be used to cut down the number of results. One possibility is categories.
Categories are accessible in several places. Among other ways, categories can be used to filter search results. The categories are listed along the left edge of the results. (Below, left.)
The search form is automatically created based upon which fields are present in the category’s record collections. Passenger lists might include fields such as arrival and departure. (Above, right.)
Sometimes indexed fields are not included in a search form, such as ship’s name in the passenger list form above. Use the keyword field when this occurs.
“One of my objectives coming out of here,” said Bacus, “is that all of you will use the advanced form.” It has many advantages. “Additionally, it makes you cool.”
[Note to my readers: As I write this, I’m getting pretty sleepy, so don’t be surprised if something hereafter doesn’t make any sense.]
One choice available on the advanced form is Exact. Setting a field to exact excludes records that have different info for the respective field. And it excludes records that don’t have or index the respective field. For example, requiring family members will exclude US censuses before 1880
A number of options are available for searching with names. Regardless of how many fields are included in the search, the only thing that is required to match is last name. Click on “use default settings…” to change options.
When using Exact name matching with multiple first names, they can math in any order, but must all be present.
Exact scores the most, then similar, then soundex/phonetic.
For dates, dates within a specified range are scored the same no matter where the date is in the range. However, if a date is outside the range, the score gets progressively lower the further it falls outside the range. To give a little extra bump to the score for a particular range, add an Any Event life event container, without exact. However, the more search fields you’ve used, the less the effect.
Ancestry employs lifespan filtering to remove clearly extraneous results. Specifying just a birth year limits results to –5 and +102 years. Entering just a death, limits dates to –105 and +2 years. Entering both limits dates to birth-5 and death+2. Lifespan filtering doesn’t work for record collections with implicit dates, such as a census date.
Speaking of censuses, the 1930 census has a bunch of people who were (incorrectly) listed as age 120. When birthdates were specified, lifespan filtering inadvertently filters these out. To counter this bug, specify a death date also, many years afterwards.
For locations, the highest scores are given to a city match, then county, then adjacent counties, then state, then adjacent state, and finally, entire country.
Like FamilySearch, the Ancestry list of places contains mostly modern place names. if a place doesn’t exist anymore, use smallest matching location, then add the specific historic place name in the keyword field. [Note to self: There’s a lot of Washington, Iowas. Note to readers: still sleepier. I’m trying to hold it together to get this posted tonight. Note to self: FamilySearch has a long, long, long way to go to catch up to Ancestry.]
Specifying a gender excludes the other gender, but not those without gender.
Specifying a keyword searches all indexed fields, whether or not they are shown on the search form.
When Collection Priority is not specified, a country is favored matching which Ancestry website you are using. If you specify collection priority explicitly, it changes scoring.
Some of these options are “sticky.” That is, they are remembered. Last name filters, collection priority, and Restrict To are sticky.
When using wildcards, the first or last letter can be a wildcard, but not both.
Regardless of the Restrict To setting, tree results are not mixed in with record results. However, if a really great tree result is found, it will be listed before all record results.
Note to readers: I made it! I’m done! I’m sorry I wasn’t able to take the time to make that last stuff intelligible. Please imagine that it was.
Oh, and play like I came up with a pithy conclusion.