Last Friday I was excited to see Ancestry.com had added a whole lot of new newspapers to its collection. I took especial notice of the Bozeman Daily Chronicle of Bozeman, Montana and immediately checked it out.
You’ll be happy to know that if you need the Daily Chronicle for 14 October 1926, you’re in luck because that is the only issue in Ancestry.com’s collection. I previously complained about Ancestry.com’s Salt Lake Tribune database. Number of issues at the time? Two. (I’m pleased to find as I write this article that they’ve beefed up the Tribune considerably. However, according to the card catalog, the database has never been updated. That may be why the additions were never announced on the New and Updated Databases page. But I digress…)
The new card catalog can be used to identify the leanest newspapers in the Ancestry.com collection. Sort by record count and page through to the end of the list.
Interestingly, all the sizes are multiples of 60. If one investigates the last newspaper in the list, one finds it is comprised of a single page from a single issue. One can also verify that the papers with size of 120 have but two pages. Continuing, one finds that the number of pages is equal to the reported size divided by 60.
Why does Ancestry.com claim newspaper sizes 60 times larger than actually size?
You may recall we previously saw an anomaly in the new card catalog that led us to the assumption that the size displayed by the card catalog is a name count rather than a record count. (See “Extras in the Ancestry.com New Card Catalog.”) Ancestry.com needs to identify what they mean by “Size.”
So what is the probability that each page of these newspapers has exactly 60 names?
Name counts in OCR databases
Estimation is necessary for non-table-style databases such as books and newspapers because the index is an every-word index, obtained via OCR. OCR, optical character recognition, is a software process wherein a computer program attempts to read the images and create a matching document with all the words found on the image.
After the task of recognizing words, the computer still doesn’t know what the words mean. What you and I easily recognize as a person’s name is beyond the computer’s ability to identify with any degree of certainty. Thus, newspaper vendors must estimate the number of names present.
Obviously, some pages will have more and some less. To be sure, the subject of a news article will likely be named multiple times. Should a vendor attempt to count unique names? That will undercount the actual number of names. Should a vendor attempt to count unique people? Equating and differentiating people requires intelligence well beyond that of machines.
It would seem from our table above that Ancestry.com assumes an average of 60 names per page across all the pages of a newspaper. In my career, I’ve seen samples justifying higher numbers, when repeated names are included. Overall, I think 60 is very reasonable.
My recommendations to Ancestry.com and FamilySearch (and other vendors) are to be completely transparent in your size claims:
- Don’t report a size number without identifying if it is a record count or a name count
- Report record counts, name counts, and image counts
- For table-style databases, report the actual number of names present
- When an estimated count is published, designate it as such (I favor use of the approximation symbol (≈) to identify estimated numbers)
- For estimated counts, document how the estimate was obtained
Ancestry.com’s New Newspapers
Here’s the entire list of new newspapers posted Friday: