Monday, October 29, 2012

Monday Mailbox: Errors in Ancestry.com City Directories

Dear Ancestry Insider,

I have been a member of ancestry.com for many years. In the past year, especially, I've noticed many, maNY, MANY errors in the transcribing and indexing on ancestry.com.

I accessed the City Directories for Enid, OK and wow, what I found was astoundingly egregious. All you have to do to verify what I'm saying is to start with the 1905 Enid, OK City Directory and search for Joseph Hutcheson. You'll go mad trying to find a decent index for him.

Access the City Directory for Enid, OK and start going up the years. On the way, please do enjoy such genealogical tidbits as "Walnut Hyde" (taken from the street name and last name....don't ask me) and "Baking Soda Tea Bisquits..." taken from a store advertisement. Then, there's also "telephone 7 days a week" and "Ill" instead of an "H", or "II" (as in "The Second"), instead of an "H", even after women's names.

Some of the most egregious errors I've ever seen! It is quite apparent that whoever is doing the transcribing is not only unfamiliar with the English language and common names, but with the English alphabet, as well. Beyond that, there is apparently NO quality control - no checking for veracity. These butchered transcriptions and indexes are sent to Ancestry.com where they're uploaded immediately - if they do random checking, then they're using aliens or people who don't speak English ...or children who can't spell...or read, to do it.

The way some letters don't seem to be understood by the transcriber, tells me that whoever is transcribing doesn't speak English well and probably is sitting in front of some kind of chart. But since these people are probably paid by the record, their managers are probably telling them to hurry, hurry, hurry. That's their incentive. And that, sadly, is the reality of what's happening to the records at Ancestry.com.

Signed,
Patriot Gal*

Dear Patriot Gal,

It will come as no surprise to you to learn that about a year ago Ancestry.com started employing a whole new set of indexers for their city directories. These indexers do not speak English. They have problems reading. As you surmised, they work from some kind of chart, as the English alphabet is foreign to them.

While they hurry quite a bit faster than you or I would, it is not greed or speed that induces them to count advertisements as directory entries. The problem there is pure stupidity. That’s right. They are not smart enough to always recognize what are obviously advertisements.

They aren’t children or space aliens. These new indexers are computers. And despite what some people think, computers are very stupid, especially when it comes to reading and comprehension. It is a great achievement that Ancestry has coerced the degree of accuracy that it has from these directories. Most computer read content is far, far worse.

For more information, see my article, “Data Extraction Technology at Ancestry.com.”

Signed,
--The Insider

9 comments:

  1. I am not employed by ancestry.com or have any affiliation with them. The computers use something called OCR, which stands for Optical Character Recognition. This can be a great aid to adding records to a research database, but like you've seen it does have it's drawbacks. The OCR prgram probably looks for patterns to put names into it's index. If any particular set of words falls into the pattern, it's indexed.

    I use OCR sometimes on my small research database I use. However, I manually check almost every record to make sure it looks right. I don't think ancestry.com could use this method with the large volume of records they load.

    Regards, Jim
    Hidden Genealogy Nuggets

    ReplyDelete
  2. That same OCR process, when applied to database "Oconee County, Georgia Probate Records, 1875-2010," has led to the appearance of an enormous military population. Many of those records date from when an era when race was distinguished on the records, and "black" was marked by "Col." for colored, as was standard in that time and place.

    So, now we have hundreds - thousands? - of male and female colonels, in an era when women did not even serve in the military.

    ReplyDelete
  3. So they apparently need to hire proofreaders to check on the computers. Maybe a few underemployed humans, such as myself, could be of help.

    ReplyDelete
  4. As an example of how OCR can be used effectively look at the National Library of Australia's newspaper digitisation program at http://trove.nla.gov.au/newspaper where users are encouraged to make corrections to the OCR results. It has become an immensely useful site which allows tagging (both public and private), displays the original text and translated text side by side, and provides a number of ways to save an article with examples of the proper citation.

    ReplyDelete
  5. Quite a different problem with the City Directories is that the encoding for saving the extract (to a tree individual) in every instance omits the County identification from the place-name field.

    ReplyDelete
  6. To all concerned with indexing errors. I suggest you just simply don't use the index and look at the actual Directories or Ship Manifists or Census Enumerations or whatever yourself.

    I agree the indexing is bad, so just don't use it

    I am sorry to sound cynical but it should be fairly obvious to all what the best course of action is.

    ReplyDelete
  7. I too have noticed ridiculous errors at Ancestry, and not just in the city directories. Many times the transcribers will write a name that couldn't possibly exist in Ireland (where my ancestors come from) or much of anywhere else on the planet. This cannot be computer error, though, since computers can't read handwriting yet as far as I know. The point is that at Ancestry quality control is unknown. And we pay 25$ or 30$ for this?

    ReplyDelete
  8. The recent City Directors so obviously do not use OCR. The older ones did and had a better success rate, These are being indexed by humans who can't read English and users who can't figure out where to add corrections!

    ReplyDelete
  9. IMO that's not as bad as the transcription of the 1870 census for Philadelphia - the first enumeration - which was probably transcribed way before they outsourced! I haven't checked every district, but lots and lots of people were born in the Philipines! I let them know when I saw it. I would never have found it if I hadn't been trying to drill down in manually. Now I know why I can never find anyone in this census! :)

    Name: Richard Dennis
    Age in 1870: 45
    Birth Year: abt 1825
    Birthplace: Philippines
    Home in 1870: Philadelphia Ward 14 District 42, Philadelphia, Pennsylvania

    ReplyDelete