Friday, August 23, 2013

#FGS2013 - Getting the Most Out of Historical Newspapers

Peter Drinkwater of Newspapers.com gave a session titled, “Getting the Most Out of Historical Newspapers” at the 2013 annual conference of the Federation of Genealogical Societies. Drinkwater is director of product management for Newspapers.com, which is owned by Ancestry.com.

Newspapers are a good source of basic genealogical information such as vitals, names, and relationships. But they also contain rich genealogical and biographical information that is so important for family history research.

Newspaper search results rank very low in Ancestry.com search results, leaving newspapers an underutilized resource. So Ancestry.com approached Fold3 and asked if they could take their technology and build a newspaper site. (Fold3 still operates in separate offices in Lindon, Utah, even though they are owned by Ancestry.com.) The new site, Newspapers.com, launched in November 2012.

Both Ancestry.com and Fold3 have existing newspaper collections, which they will retain. However, neither will be adding to their collections, other than the possibility that Fold3 might add military-specific publications. Ancestry.com has given up on newspapers. The indexes are quite large and they don’t show up very well in search results. (I think it is interesting that FamilySearch wants to get into newspapers just when Ancestry.com is getting out. One must know something the other doesn’t. The question is, which one is it?)

All new additions will be made to Newspapers.com. To acquire new newspapers, they digitize microfilms of newspapers at libraries, they go directly to newspaper publishers, and they do deals with suppliers of microfilmed newspapers.

To publish a newspaper, websites take the newspaper, photograph it, enter basic citation information about each page, and OCR the rest. In the OCR—optical character recognition—process, the computer uses algorithms to try and recognize the shapes of letters, creating searchable text from the photograph image. Modern OCR engines are nearly 100% accurate on clean, new newspapers. But on dark, streaked, or poorly printed old newspaper pages, it makes lots of mistakes.

Using OCR instead of humans for indexing is a tradeoff between accuracy and the amount of material that can be published. Using OCR, Newspapers.com is now adding two to three million new pages each month.

When you search content that is OCRed, you must take that into account. Names may be misspelled in new and unusual ways. I’ve seen lowercase d indexed as “cl” or even “c1.” For some fonts or for barely legible images, it is common to see letters misindexed as punctuation marks. Lowercase d might be indexed as “i:]”. You have to think visually instead of phonetically. Fortunately, articles often repeat surnames several times in an article, increasing the chances that at least once they were indexed correctly.)

I’ve heard Ancestry.com use the term “bag of words” to describe another problematic characteristic of OCR content. When the computer indexes the page, it doesn’t recognize the contextual semantics of the words. When it reads the word “park” it doesn’t know if it is a surname or a regular word. (I doubt it even knows whether it is a noun or a verb.) You must take this into account as well.

“Newspaper search is more art than science,” said Drinkwater.

Sometimes it helps to search for first and last names in quotes, like “Emma Walk”. When you do this remember you must issue separate searches for names with and without middle name.

Before performing a newspaper search, check to see that the website you are using has newspapers in the time and locale that will likely contain the information you seek. Keep in mind that newspapers often carried news from surrounding communities. There are several ways this can be done on Newspaper.com. One way is to browse the collection. Another is to search for the name of the place. Another is to use the map located at http://Newspapers.com/map. The map has pins showing locations with papers.

7 comments:

  1. Long time reader, first time commenter.

    I have access (and a copyright release!) to the microfilms of a county newspaper that has been published weekly for nearly 130 years. Unfortunately the cost to digitize these microfilms is in the $6000 range (not including storage, backups or a website).

    Do you or any of your readers have suggestions of ways to publish this resource?

    Thanks,
    Michael Moore
    stuporglue@gmail.com

    ReplyDelete
    Replies
    1. Newspapers.org would probably be interested. It's too early to tell if FamilySearch will be interested.

      --The Ancestry Insider

      Delete
  2. Michael--Chronicling America is digitizing newspapers throughout the country with the help of the Library of Congress. I'd look for the folks in your state and get in touch to see if they might have plans to digitize them, or if you could work together to make it happen. http://chroniclingamerica.loc.gov/ Hope that helps.

    Joyce Homan, Genealogical Society of PA, execdir@genpa.org

    ReplyDelete
  3. Michael--I'm pretty sure that Mocavo still has an offer to digitize materials owned by individuals and non-profits. Check them out at http://www.mocavo.com/freescanning
    Ginny in Seattle
    [no connection to Mocavo]

    ReplyDelete
  4. Another trick for finding people that I have is searching for their home addresses, which were often present in articles relating to them (a bonus is that searching for an address sometimes brings up articles about events that happened at the house (fires, for example) and also classified real estate advertisements, which can tell you when a house was up for sale or for rent -- useful for tracking people's movements). You also have to bear in mind that public references to people's first names were less common in the early 20th century and that married women, in particular, would most likely be named in a newspaper as "Mrs. [Husband's initials or first name] [surname]".

    ReplyDelete
  5. Look for archives at academic/university libraries in your area. Also for historical societies. Same for public libraries, and for consortiums of public libraries. Some states use their consortiums to make purchases, some to do big projects like digitizing. Don't give up! This is a wonderful resource which is probably decaying (film does decay)!

    ReplyDelete
  6. Funny you should have this as a discussion right now. Might I add a suggestion? NEVER NEVER NEVER do business with NewspaperArchive.com. Their BBB rating is an F after blowing off over a hundred customer complaints, which accurately represents how they cheat their customers, change the billing terms and price, refuse refunds, and do not respond to customer issues.

    The interface is poor and doesn't work half the time, the OCR is the worst of any I've seen, the images are low dpi, and its hard to navigate the site.

    Just a completely terrible experience. Do not use NewspaperArchives.com ever. Ever.

    ReplyDelete