Tuesday, September 29, 2015

How to Navigate Around the Internet Archive Search Bug

There is a bug in Internet Archive’s “Search Inside” a book feature. Don’t trust it. Let me tell you what to do instead.

Let’s say you found your way to a book on Internet Archive (IA). It is A Complete History of Fairfield County, Ohio (at https://archive.org/details/cu31924028848483) by Hervey Scott. You want to see if Jonas Messerly is mentioned in it. You select the search magnifying glass up in the upper-right corner.

Internet Archive's title search icon

You search for “Messerly” and, oops, you just searched IA for titles rather than searching inside that single title.

Internet Archive's title search results

Wait, don’t cuss me out yet; that’s not the bug. That’s just user error and a user interface annoyance.

You find another search magnifying glass icon on the right-hand side about half way down the page. The context help popup says “search inside.” You select the icon.

Internet Archive's search inside icon

The page changes a bit and the search icon disappears.

The search inside icon is in a different place in the Internet Archive's full screen view.

Instead of instigating a search, what you’ve just done is switched from one book viewer to another. People  in the know tell me that this failure to search is not a bug. Because the design is supposed to do this, it is a WAD, “working as designed.” Fine. Let’s compromise and call it a user interface flaw. But this is still not the bug of which I speak.

The search inside icon has disappeared. The search-all-of-IA box is still up in the upper-right corner of the screen. You fell for that one once before. “Fool me once…” After looking in vain for another search icon, you notice that the search box you previously dismissed, the one that searched for book titles, is now labeled “Search inside”.

The search inside box is at the top in the Internet Archive's full screen view.

Also not the bug of which I speak. It’s another user error and user interface annoyance.

Now comes the bug. You search for “Messerly” and IA erroneously states “No matches were found.”

The Internet Archive's full screen view with no matches found message 

Rather than depend on just the “Search Inside” results, check the raw text. To do this, select the italic I—the “About this book” icon. In the popup, select Plain Text. That brings you to a page containing the raw text from the book. Now use your browser search (^F) to search for Messerly.

Some raw text from an Internet Archive book

There he is on page 73. Now back up to the book viewer and advance to page 73.

Mention of Jonas Messerly in a history of Fairfield County, Ohio

One of the distinct advantages of Internet Archive over Google Books is that downloaded PDF files are searchable. I tested the above book and found that Adobe Reader is not affected by the search bug. You can download from IA with the confidence that your offline study will not be affected.

Mention of Jonas Messerly in a history of Fairfield County, Ohio

Be aware that OCR errors are unaffected by any of this. If a word was not recognized when scanned, then all of these methods will fail to find it.

Finally, the Internet Archive is a non-profit organization that accomplishes amazing things with very little money. No one should be surprised that there are flaws in their software. We are all in their debt. They accept contributions at https://archive.org/donate.

9 comments:

  1. At first I thought you were confused by differences in OCR results. At https://archive.org/stream/cu31924028848483#page/n95/mode/2up if you search (at "Search Inside") for "Jonas" on that same page image, you will find it. It seemed that the problem was likely differences in the OCR information stored in each document. But after more experimentation, I realized that virtually all words (including names) that were terminated with commas ("Messerly," in your example) were not being found, but words with no following non-letter glyph were found. In https://ia802604.us.archive.org/32/items/cu31924028848483/cu31924028848483_djvu.xml you can see that the token at particular location includes the terminating punctuation ("Messerly,", "Kidwell,"). I expected this would be a trivial problem to fix, until I found a counter example ("Paul,") where the token is found by a search. In any case, I expect this is a bug and that IA can fix it probably easily. (For a single book, it may be as easy as removing punctuation at the end of the tokens in the djvu xml file and rerunning the indexing. It would be better to remove those as an index is created, but I can imagine that could have other effects).

    A quick search of the IA forums for "search bug" did not lead me to a posting about this problem. Have you brought the problem to their attention?

    ReplyDelete
    Replies
    1. Brilliant! That's very insightful. I hadn't thought to see what could be learned by looking at the XML file. That explains why the PDF file still works, since Adobe Reader will match on partial words. And that tells me to pay more attention to punctuation. If I can't find what I know to be present, I'll include the punctuation I know follows the word.

      Delete
  2. This is brilliant! Thank you so much.

    ReplyDelete
  3. I've tried to read and understand this article 4 times and am still confused! Yes, I use IA and no I'm not particularly stupid...but some things need to be shown, not written.

    ReplyDelete
    Replies
    1. Lynn,

      It is not you. I found it particularly difficult to explain clearly. That's why there are so many screen images. You're right, though, if it were shown in person, it would make perfect sense.

      ---tai

      Delete
  4. I have'nt a problem with this bug but with the search bug in Open library described in German at http://archiv.twoday.net/stories/1022397050/. I am unable to find something by entering more than two words. Ideas?

    ReplyDelete
  5. Thank you so much! I wanted to let you know that this post is included in my NoteWorthy Reads #22: http://jahcmft.blogspot.com/2015/10/noteworthy-reads-22.html Enjoy your weekend!

    ReplyDelete
  6. Well placed A I, I am a regular user of this fabulous site, why, well my uni search engine uses it as a first port of call.
    I had a little hassle at first but found if I clicked the 4 cornered/arrows it would bring me up to search the book contents instantly and in searching xxxx it would line up across the bottom with the page access points that would relate to xxxx search. Some times this would be literally full so I started putting before and after - dots,comas, spaces etc and that reduces the find. Also like you say if you get it in raw text first (that picks up every mark when digitizing) do your search like always Ctrl+f and you will find what you are looking for.
    Once you have found the page or pages, then down load it in the other formats in the right hand column/box.
    When you get the hang of it, you will most probably wont want to leave it as it will most likely have 75/80% or more of your answers you need in your research.

    ReplyDelete
  7. Hi - thanks so much for posting this. I love Internet Archive and much prefer them over google books - but the search thing has been driving me crazy. Like you, I realized if you switch to to a text view, you'll find what you're looking for. Today, I was actually inspired to get on the forum at Internet Archive to complain about this search issue, and found that somebody has cited your blog post here and opened a bug report. Sure hope they fix it - lots of genealogy users will otherwise walk away thinking nothing is there, when there is!

    ReplyDelete