Tuesday, June 2, 2009

Name counts for Newspapers

Bozeman Daily Chronicle on Ancestry.com Last Friday I was excited to see Ancestry.com had added a whole lot of new newspapers to its collection. I took especial notice of the Bozeman Daily Chronicle of Bozeman, Montana and immediately checked it out.

You’ll be happy to know that if you need the Daily Chronicle for 14 October 1926, you’re in luck because that is the only issue in Ancestry.com’s collection. I previously complained about Ancestry.com’s Salt Lake Tribune database. Number of issues at the time? Two. (I’m pleased to find as I write this article that they’ve beefed up the Tribune considerably. However, according to the card catalog, the database has never been updated. That may be why the additions were never announced on the New and Updated Databases page. But I digress…)

The new card catalog can be used to identify the leanest newspapers in the Ancestry.com collection. Sort by record count and page through to the end of the list.

Interestingly, all the sizes are multiples of 60. If one investigates the last newspaper in the list, one finds it is comprised of a single page from a single issue. One can also verify that the papers with size of 120 have but two pages. Continuing, one finds that the number of pages is equal to the reported size divided by 60.

Title Size   Pages
Chicago Daily News (Chicago, Illinois) 360 6
Weekly Evening Gazette (Reno, Nevada) 360 6
Decatur Daily Review (Decatur, Illinois) 360 6
The Daily Mail (Charleroi, Pennsylvania) 300 5
The Southern Immigrant (Cullman, Alabama) 300 5
The Hancock News (Hancock, Wisconsin) 240 4
The State Sentinel (Decatur, Illinois) 240 4
East St Louis Journal (East Saint Louis, Illinois) 240 4
Californian (Monterey, California) 240 4
Bulletin Sentinel (Decatur, Illinois) 120 2
Bridgeport Sunday Post (Bridgeport, Connecticut) 120 2
Weekly Decatur Magnet (Decatur, Illinois)   60 1
New York Times (April 15th, 1865)   60 1
The Mountain Democrat and Placerville Times (Placerville, California)    60 1

 

Why does Ancestry.com claim newspaper sizes 60 times larger than actually size?

You may recall we previously saw an anomaly in the new card catalog that led us to the assumption that the size displayed by the card catalog is a name count rather than a record count. (See “Extras in the Ancestry.com New Card Catalog.”) Ancestry.com needs to identify what they mean by “Size.”

So what is the probability that each page of these newspapers has exactly 60 names?

Not very likely, is it? In response to my article “Unbelievable Name Count Claims,” Paul Allen of WorldVitalRecords.com commented that vendors sometimes estimate the number of names in a database.

Name counts in OCR databases

Estimation is necessary for non-table-style databases such as books and newspapers because the index is an every-word index, obtained via OCR. OCR, optical character recognition, is a software process wherein a computer program attempts to read the images and create a matching document with all the words found on the image.

After the task of recognizing words, the computer still doesn’t know what the words mean. What you and I easily recognize as a person’s name is beyond the computer’s ability to identify with any degree of certainty. Thus, newspaper vendors must estimate the number of names present.

Obviously, some pages will have more and some less. To be sure, the subject of a news article will likely be named multiple times. Should a vendor attempt to count unique names? That will undercount the actual number of names. Should a vendor attempt to count unique people? Equating and differentiating people requires intelligence well beyond that of machines.

It would seem from our table above that Ancestry.com assumes an average of 60 names per page across all the pages of a newspaper. In my career, I’ve seen samples justifying higher numbers, when repeated names are included. Overall, I think 60 is very reasonable.

My recommendations to Ancestry.com and FamilySearch (and other vendors) are to be completely transparent in your size claims:

  • Don’t report a size number without identifying if it is a record count or a name count
  • Report record counts, name counts, and image counts
  • For table-style databases, report the actual number of names present
  • When an estimated count is published, designate it as such (I favor use of the approximation symbol (≈) to identify estimated numbers)
  • For estimated counts, document how the estimate was obtained

Ancestry.com’s New Newspapers

Here’s the entire list of new newspapers posted Friday:

Genealogy Database Title Posted
Deming Headlight, The (Deming, New Mexico) 5/29/2009
Star Herald (Scottsbluff, Nebraska) 5/29/2009
Bozeman Daily Chronicle (Bozeman, Montana) 5/29/2009
Dundee Record (Dundee, New York) 5/29/2009
Daily Journal (Herrin, Illinois) 5/29/2009
Boone Svenska Herald (Boone, Iowa) 5/29/2009
Cherokee Daily Times (Cherokee, Iowa) 5/29/2009
Dyersville Commerrial (Dyersville, Iowa) 5/29/2009
Evening Journal Farm Edition (Washington, Iowa) 5/29/2009
Gladbrook Tama Northern (Gladbrook, Iowa) 5/29/2009
Grinnell Herald Register (Grinnell, Iowa) 5/29/2009
Hopkinton Leader (Hopkinton, Iowa) 5/29/2009
Humboldt Republican (Humboldt, Iowa) 5/29/2009
Jewell Record (Jewell, Iowa) 5/29/2009
Lake City Graphic (Lake City, Iowa) 5/29/2009
Leon Journal Record (Leon, Iowa) 5/29/2009
Lone Tree Reporter (Johnson, Iowa) 5/29/2009
Manchester Democrat Radio (Manchester, Iowa) 5/29/2009
Manning Monitor (Manning, Iowa) 5/29/2009
Marshall County Time (Marshall, Iowa) 5/29/2009
North Iowa Times (McGregor, Iowa) 5/29/2009
Sac Sun (Sac County, Iowa) 5/29/2009
Schaller Herald (Schaller, Iowa) 5/29/2009
Story City Herald (Story City, Iowa) 5/29/2009
The Alabama Courier (Athens, Alabama) 5/29/2009
Bruce News-Letter (Bruce, Wisconsin) 5/29/2009
Green-Bay Intelligencer (Navarino, Wisconsin) 5/29/2009
The Fairfield Daily Ledger (Fairfield, Iowa) 5/29/2009
Indiana Progress (Abbeville, Alabama) 5/29/2009
The Western Telegraph (Rossville, Ohio) 5/29/2009
Chicago Heights Star Sports (Chicago Heights, Illinois) 5/29/2009
Belmont Gazette (Belmont, Wisconsin) 5/29/2009
Oconomowoc Democrat (Oconomowoc, Wisconsin) 5/29/2009
Yellow River Lumberman (Necedah, Wisconsin) 5/29/2009
Oconto County Enterprise (Oconto, Wisconsin) 5/29/2009
Du Buque Visitor (Du Buque, Wisconsin) 5/29/2009
Winnebago Anzeiger (Menasha, Wisconsin) 5/29/2009
The Maiden Rock Press (Maiden Rock, Wisconsin) 5/29/2009
The Rice Lake Times (Rice Lake, Wisconsin) 5/29/2009
The Benton Advocate (Benton, Wisconsin) 5/29/2009
The Revealer (Bloomington, Wisconsin) 5/29/2009
The People's Champion (Ellsworth, Wisconsin) 5/29/2009
Prison Press (Waupun, Wisconsin) 5/29/2009
The Fairchild Graphic (Fairchild, Wisconsin) 5/29/2009
Fox Lake Representative (Fox Lake, Wisconsin) 5/29/2009
The Wisconsin Standard (Geneva, Wisconsin) 5/29/2009
Voice-Herald (Hales Corners, Wisconsin) 5/29/2009
The Breeze (Pardeeville, Wisconsin) 5/29/2009
Beloit Journal, of Politics, Literature, and General Intelligence (Beloit, Wisconsin) 5/29/2009
Agitator (Wellsborough, Pennsylvania) 5/29/2009
Arizona Silver Belt (Miami, Arizona) 5/29/2009
Charleston Daily Mail (Charlestown, West Virginia) 5/29/2009
Fort Dodge Messenger And Chronicle (Fort Dodge, Iowa) 5/29/2009
Horicon Argus (Horicon, Wisconsin) 5/29/2009
Lockhart Post-Register (Lockhart, Texas) 5/29/2009
Marble Rock Journal (Marble Rock, Iowa) 5/29/2009
Monroe Sentinel (Monroe, Wisconsin) 5/29/2009
Mt Pleasant News (Mt Pleasant, Iowa) 5/29/2009
Natrona County Tribune (Casper, Wyoming) 5/29/2009
Neenah Bulletin (Neenah, Wisconsin) 5/29/2009
Newcastle News-Journal (Newcastle, Wyoming) 5/29/2009
Northwestern Record (Sheboygan Falls, Wisconsin) 5/29/2009
Reedsburg Herald (Reedsburg, Wisconsin) 5/29/2009
Richland County Observer (Richland Center, Wisconsin) 5/29/2009
Mt Ayr Journal (Mount Ayr, Iowa) 5/29/2009
Fayette Journal (Fayetteville, West Virginia) 5/29/2009
Lebanon Daily News (Lebanon, Pennsylvania) 5/29/2009
West Eau Claire Argus (West Eau Claire, Wisconsin) 5/29/2009
Milford Mail (Milford, Iowa) 5/29/2009
Terril Record (Terril, Iowa) 5/29/2009
Lake Park News (Lake Park, Iowa) 5/29/2009
Spirit Lake Beacon (Spirit Lake, Iowa) 5/29/2009
Adams County Free Press (Corning, Iowa) 5/29/2009
Desert Hot Springs Sentinel (Desert Hot Springs, California) 5/29/2009
Aiken Standard (Aiken, South Carolina) 5/29/2009
Tyrone Daily Herald (Tyrone, Pennsylvania) 5/29/2009
Daily News (Huntingdon, Pennsylvania) 5/29/2009
Tyrone Star (Tyrone City, Pennsylvania) 5/29/2009
Evening Chronicle (Marshall, Michigan) 5/29/2009
Sioux County Capital (Orange City, Iowa) 5/29/2009
Alton Democrat (Alton, Iowa) 5/29/2009
Boyden Reporter (Boyden, Iowa) 5/29/2009
Sioux Center News (Sioux Center, Iowa) 5/29/2009
Hospers Tribune (Hospers, Iowa) 5/29/2009
Ireton Ledger (Ireton, Iowa) 5/29/2009
Maurice Times (Maurice, Iowa) 5/29/2009
Sioux County Index (Hull, Iowa) 5/29/2009
Avalanche (Lubbock, Texas) 5/29/2009
Laredo Times (Laredo, Texas) 5/29/2009

4 comments:

  1. Insider,

    Very good general topic, i.e. newspaper databases. All of the major providers reek of lack of integrity in advertising. Ancestry, and Newspaper Archive from whom it gets a lot (most?) of its newspaper material, is following a familiar pattern:

    1) Hyping numbers of names or titles added. When what is needed is broad coverage of a locality for decades, instead of scattered odd papers with less than 10 issues (or even only one full year). GenealogyBank (part of Newsbank), is the title count hyper sine qua non.

    2) Deceptive timespans claimed. On Newspaper Archive's site if one clicks on a title, it is not rare to find a claimed run of decades and then looking deeper finding only issues at each end of that span. GenealogyBank does this too with their "America's Obituaries", in claiming for example to have obits for a newspaper from 1977 to current when based on multiple searches for known obits/surnames they can only have sparse coverage. Ancestry's obit coverage isn't good, but they don't make the claims of obituary completeness that GB does.


    Newspapers potentially have a lot of promise, but once needs so much coverage to find a little. Kind of like panning for gold. Especially for early papers. On GB's blog a while back found here:

    http://blog.genealogybank.com/2009/01/genealogybank-adds-170-newspapers-from.html

    I commented and questioned TK on these issues, but without getting a straight answer for most of them. I asked why not really push to cover a state or two a year instead of a scattershot approach everywhere that the major sites seem to think is a best practice. And I asked about the costs of approaching say the Indiana State Library and seeking to digitize the state's newspaper microfilm collection (created through grants from the US Newspaper Project). Is it too expensive for the potential rewards? I don't know but am interested in the answer.

    In summary give me long continuous runs of newspapers in areas I research, not scattered issues here and there while claiming huge title and name counts.

    Mike

    ReplyDelete
  2. Mike,

    Thanks for your thoughts.

    Obtaining newspapers from microfilm is much, much cheaper than scanning actual pages.

    The bug-a-boo comes in the form of intellectual property rights. Possession of the microfilm doesn't give the possessor the rights to permit copying it. The party that sold the microfilm to the library has contractual rights, since there is a purchase agreement, explicit or implied. The party that produced the microfilm might claim copyrights, to the extent that it can be argued that producing a good microfilm image is more art than science. And the party that possessed the originals has the contractual rights agreed to between itself and the filming company; that agreement probably places boundaries around what the microfilming company can do.

    The contracts involved were usually written early and don't address the question of digital imaging. Contracts have been lost, so companies don't always know what their working boundaries are. Libraries, the poorest of these organizations, are at the end of the food chain, so face the greatest risk of lawsuit.

    For all these reasons, holders of newspaper microfilm collections are sometimes reticent to allow digitization of their collections.

    -- The Ancestry Insider

    ReplyDelete
  3. Insider,

    Thanks for the response. I am sure what you said is valid, and the legal flip side of the cost side. But I specifically mentioned the IN State Library's holdings for a reason. They filmed papers through federal grants of the US Newspaper Project from what I understand (though I by no means know all the details). So the ISL produced the microfilm. Then for all such holdings not still under copyright, why can't Ancestry/GB/whoever, approach them and say they will digitize the collection for their commercial site, and in return, allow free access at all Indiana public libraries or something like that (but without home access through the libraries to retain potential for in-state customers). And if there are indeed any other dangling legal issues, then as part of the state government the ISL can invoke sovereign immunity (note that I am not talking about even coming close to breaching copyrights).

    My main point is the question of whether such a plan is feasible within a certain timeframe and then can be applied to other states. I don't live in IN but just am familiar with the newspaper holdings there from investigating same (which are mostly available through inter-library loan to other states).

    Mike

    ReplyDelete
  4. Good sleuth work on figuring out the name count issue. Ancestry.com does have its issues, but it's the best thing we've got going for online research at the moment. I met some Ancestry.com executives a few years ago when I filmed a testimonial for them for an infomercial (something I got invited to do after submitting a written testimonial), and I was really surprised that most of them didn't seem to know exactly how their own website worked! My husband and I were teaching THEM about some of the functions of the site that they didn't even know about!

    Stephanie at the Irish Genealogical Research blog

    ReplyDelete