Peter Drinkwater of Newspapers.com gave a session titled, “Getting the Most Out of Historical Newspapers” at the 2013 annual conference of the Federation of Genealogical Societies. Drinkwater is director of product management for Newspapers.com, which is owned by Ancestry.com.
Newspapers are a good source of basic genealogical information such as vitals, names, and relationships. But they also contain rich genealogical and biographical information that is so important for family history research.
Newspaper search results rank very low in Ancestry.com search results, leaving newspapers an underutilized resource. So Ancestry.com approached Fold3 and asked if they could take their technology and build a newspaper site. (Fold3 still operates in separate offices in Lindon, Utah, even though they are owned by Ancestry.com.) The new site, Newspapers.com, launched in November 2012.
Both Ancestry.com and Fold3 have existing newspaper collections, which they will retain. However, neither will be adding to their collections, other than the possibility that Fold3 might add military-specific publications. Ancestry.com has given up on newspapers. The indexes are quite large and they don’t show up very well in search results. (I think it is interesting that FamilySearch wants to get into newspapers just when Ancestry.com is getting out. One must know something the other doesn’t. The question is, which one is it?)
All new additions will be made to Newspapers.com. To acquire new newspapers, they digitize microfilms of newspapers at libraries, they go directly to newspaper publishers, and they do deals with suppliers of microfilmed newspapers.
To publish a newspaper, websites take the newspaper, photograph it, enter basic citation information about each page, and OCR the rest. In the OCR—optical character recognition—process, the computer uses algorithms to try and recognize the shapes of letters, creating searchable text from the photograph image. Modern OCR engines are nearly 100% accurate on clean, new newspapers. But on dark, streaked, or poorly printed old newspaper pages, it makes lots of mistakes.
Using OCR instead of humans for indexing is a tradeoff between accuracy and the amount of material that can be published. Using OCR, Newspapers.com is now adding two to three million new pages each month.
When you search content that is OCRed, you must take that into account. Names may be misspelled in new and unusual ways. I’ve seen lowercase d indexed as “cl” or even “c1.” For some fonts or for barely legible images, it is common to see letters misindexed as punctuation marks. Lowercase d might be indexed as “i:]”. You have to think visually instead of phonetically. Fortunately, articles often repeat surnames several times in an article, increasing the chances that at least once they were indexed correctly.)
I’ve heard Ancestry.com use the term “bag of words” to describe another problematic characteristic of OCR content. When the computer indexes the page, it doesn’t recognize the contextual semantics of the words. When it reads the word “park” it doesn’t know if it is a surname or a regular word. (I doubt it even knows whether it is a noun or a verb.) You must take this into account as well.
“Newspaper search is more art than science,” said Drinkwater.
Sometimes it helps to search for first and last names in quotes, like “Emma Walk”. When you do this remember you must issue separate searches for names with and without middle name.
Before performing a newspaper search, check to see that the website you are using has newspapers in the time and locale that will likely contain the information you seek. Keep in mind that newspapers often carried news from surrounding communities. There are several ways this can be done on Newspaper.com. One way is to browse the collection. Another is to search for the name of the place. Another is to use the map located at http://Newspapers.com/map. The map has pins showing locations with papers.