Friday, August 31, 2007

The Dark Side of Steve Morse's Wonderful Website

Who is Steve Morse?

Dr. Stephen P. Morse is a famous computer inventor, professor, author and researcher. He invented the great-grandfather of the Pentium chips that power today's desktop computers. In 2001 Morse brought his considerable talent, including web programming skills, to improving genealogical website searches. For this work, he has been honored with numerous awards (IAJGS 2003, IAJGS 2006, NGS 2007 and APG 2007).

What is the One Step Website?

When Morse first tried to use the Ellis Island website, he was frustrated by the inability to accomplish a powerful search in a single step. Some basic tasks took multiple, lengthy searches. Out of necessity he created a "One Step" Ellis Island search form that issued one or more searches to the Ellis Island website, combining and filtering the results. He soon began sharing his One Step Ellis Island web page with others, allowing researchers to search the Ellis Island website from his website in ways not possible on the Ellis Island website.

Soon, Morse had established an entire collection of One Step web pages that provided One Step searches of databases on EllisIsland.org, CastleGarden.org, Ancestry.com and more. He allows anyone to use these "One Step Webpages" at his website, http://www.stevemorse.org, without charge as a public service. And indeed, thousands use his website every day. Over the years hundreds of thousands have benefited from his wonderful site, including the Insider himself.

What's the Harm in That?

In July users of Morse's One Step searches of New York Naturalization records on the websites of the Italian Genealogical Group (IGG) and the German Genealogy Group (GGG) found their searches blocked and began to complain about the IGG and the GGG.

"We are not the bad guys," responded June C. DeLalio, a founding member of the IGG. "Neither IGG nor GGG blocked people from accessing our databases through Steve Morse's site." DeLalio explained that the ISP they were paying to host their websites blocked Morse. Morse's traffic overwhelmed their server. "Our ISP provider crashed which affected not only our two organizations but all of his other clients. He therefore banned Morse's access."

What is a Denial of Service Attack?

The need to block someone is not unusual for an ISP. When a website is hit with a large number of requests in a very short time frame, it can overwhelm, disable and even crash the web server. Server operators must have the ability to block a stream of requests that could bring their servers down. When bad guys try to overwhelm a server, it is called a denial-of-service (DoS) attack. It can also happen accidentally, with no malevolence intended; Wikipedia calls this an unintentional attack. "This can happen when an extremely popular website posts a prominent link to a second, less well-prepared site."

If you were an early user of PAF Insight or if you are a frequent user of FamilySearch.org, there is a good chance you have experienced being blocked. Did PAF Insight ever give you this message?

Family Search access temporarily suspended
Access has been temporarily suspended
by the Family Search server due to overloading.

Or has FamilySearch.org ever given you this message?

Your access to FamilySearch.org has been temporarily suspended. Your computer or a computer at your location has sent a large volume of requests to FamilySearch.org and as a result access to FamilySearch.org from this location has been temporarily suspended. Usually the suspension occurs when a computer uses software to automate requesting information from FamilySearch.org from any computer in your location. If this message appears again, contact the vendor who sold you your software to obtain the latest release.

The first version of PAF Insight caused an unintentional attack on FamilySearch.org by sending a large volume of requests almost simultaneously to FamilySearch.org for each name searched. FamilySearch.org now blocks traffic coming from the early versions. In later versions Ohana software cut the number of requests per search to a level that FamilySearch.org could handle.

Did One Step Mis-step?

As for One Step harming IGG/GGG, DearMYRTLE suggests that one must consider another possibility. "The ISP would have been more overwhelmed if the searches had not gone through Steve Morse's more efficient search engine. Why? Without Steve Morse's search engines, people typically spend more frustrating hours..., thereby truly bogging down ISP servers."

There are a couple of ways Morse can implement a One Step search. One is non-intrusive and one can have dramatic effects on a website. To demonstrate, let's try some Social Security Death Index (SSDI) searches using Morse's One Step SSDI.

Method 1: Build the URL

For a Build-the-URL search, Morse builds a URL and calls the target website. To see an example of the Build-the-URL method, follow these instructions.

  • Go to http://stevemorse.org/ssdi/ssdi.html
  • Set "First Name" to "starts with" and enter "S"
  • Set "Last Name" to "is exactly" and enter "Morse"
  • Set "hits/page" to "100"
  • Set "Search Engine to use:" to "RootsWeb"
  • Click Search
  • When asked "Searches involving... Do you wish to continue?", click "OK"

The results look something like the image below. The first thing to notice is that the URL is for http://ssdi.genealogy.rootsweb.com/. The important part is "rootsweb.com." The One Step webpage built a URL for RootsWeb that accomplished what you specified, and then requested that URL from RootsWeb for you. Notice that RootsWeb returned 17 results and bothered us with two advertisements, for which RootsWeb got paid.

Screen Image of Build-the-URL Search

If the link on Morse's website brings no new searchers to the target website, then a Build-the-URL search might decrease the total number of searches as DearMYRTLE suggested. But if Morse's website is really popular and really well known (which it is) and brings lots and lots of new searchers to a small, previously unknown website, then the target website can be—quite unintentionally—brought down.

Method 2: Automated Agent

For an Automated-Agent search, Morse's server stands between the user's computer and the target server, requesting multiple pages that the user doesn't see, filtering and aggregating the results before returning them to the user.

To see an example of an Automated-Agent search, change the "Search Engine to use:" to "RootsWebPlus" and repeat the same search as before.

The results look very similar, as shown in the next image. Notice, however—and this is very important—that the URL is for http://stevemorse.org/. What's going on here? It certainly looks like RootsWeb. But in fact, it is not. We sent a request to Morse's website, which made an unknown number of requests to RootsWeb, aggregated the results and returned them to us.

The situation has some technical characteristics similar to Ancestry's botched Internet Biographical Collection. (Hopefully, Morse has RootsWeb's permission, making the two situations legally different.) Advertisements are the first casualties. As we shall see, hundreds of advertisements have been removing, cheating advertisers if they pay for them and cheating RootsWeb if they don't.

Screen Image of Automated-Agent Search

Scroll through the page and you will find that instead of 17 results, there are 100 followed by a "NEXT" link. For this page alone, Morse's website sent 41 search requests to RootsWeb. (See the sidebar, "Counting the Requests," to learn how to figure out the number of requests made behind the scenes.) The 41 requests hit RootsWeb much faster than a human could send them.

Counting the Requests

Skip this sidebar if you're not interested in the nitty gritty details calculating how many search requests are made to RootsWeb for each page of the RootsWebPlus example.

Since RootsWeb doesn't allow wildcards in the first three characters, to get all given names starting with S, stevemorse.org must ask for all the name combinations from "SAA" to "SZZ". That would be 676 different search requests.

There is also the single initial S followed by a space and a middle name or initial. (Using a dash in the middle to represent a space between two letters, we can represent this as "S–A" to "S–Z"). That is another 26 combinations.

And there's the possibility of two letter first names (which we can represent as "SA–" to "SZ–"), another 26 combinations. Finally, there is the single initial "S" ("S––"). That is 729 combinations.

In practice, one can treat the space character as a letter of the alphabet coming before A. The series then becomes
"S––", "S–A", "S–B", "S–C", ..., "S–Z", (initial "S")
"SA–", "SAA", "SAB", "SAC", ..., "SAZ", (starts with SA)
"SB–", "SBA", "SBB", "SBC", ..., "SBZ", (starts with SB)
... (starts with SC through SY)
"SZ–", "SZA", "SZB", "SZC", ... "SZZ" (starts with SZ)

That makes 27 rows each with 27 combinations, which is, again, 729 total combinations.

The first page starts with "S Morse" and ends with "Samuel E Morse". That is all 27 combinations from "S––" to "S–Z" and another 14 combinations from "SA–" to "SAM". That's a total of 41 combinations, so it takes 41 searches.

The second page goes from "Samuel" to "Seymour N". That's 13 searches for the rest of the "SA" combinations, 27 for the "SB" combinations, 27 for the "SC"s, 27 for the "SD"s and 26 (all but "SEZ") from the "SE" combinations. That's 120 searches for page 2 (13+3*27+26)!!

The remaining pages can be analyzed in similar fashion. Page 3 goes from "Seymour" to "Sophie", which takes 262 searches (2+9*27+17). If you add all the searches from all the pages you'll get 728 plus the number of pages. That's more than the 729 predicted. Can you see why? If you look at the URL on each page, can you figure out how Morse knows where to start each page? On several pages look at the little table above the search results. With what you've learned, does the value for the First Name make sense now?

Click the NEXT link. Notice how much longer it takes for the second page? That's because stevemorse.org made 120 search requests to RootsWeb! The third page makes 262 requests to RootsWeb. Over all pages, over 729 search requests will be made to service your one search.

Just like PAF Insight, an Automated-Agent search can send a significant number of searches at an incredibly fast rate to the website being searched and, if enough users do it, can bring down a site as large as FamilySearch.org.

Trouble With Ancestry?

Some of you may remember Morse used to provide Automated-Agent searches of various Ancestry databases. The Insider used stevemorse.org whenever he needed to bypass Ancestry's wildcarding rules. Sadly, this feature went away. We can only assume Ancestry asked him to eliminate this capability because:

  • It was a neat feature that we don't think Morse would have taken away unless it was requested.
  • The feature required the user to break the Ancestry Terms and Conditions by disclosing their username and password.
  • The feature's instructions disclosed information that allowed inappropriate misappropriation of Ancestry's cookies.
  • A wildcard search such as "S*" would have hit Ancestry's servers with 729 near-simultaneous searches. (We don't remember for sure, but we don't think you could do a 3-character wildcard "???" that would issue 19,683 searches.)
  • Advertisements would have been suppressed. Either the ads would be "served" and the advertisers would be cheated because no one viewed ads they paid for, or the ads would not be served and Ancestry would have been cheated of the ad revenue. (Wouldn't it be ironic if Ancestry complained to Morse over this issue and then turned around and did it to others with the Internet Biographical Collection? No one has shorter memories than corporations.)
  • Having pages coming from the stevemorse.org domain that looked and felt like Ancestry pages would have been a copyright violation.
  • On the other hand, serving up results without acknowledging the owner, is probably misappropriation of intellectual property. (This and the last item are a catch22. You're damned if you do and damned if you don't.)

Conclusion

Steve Morse has been enhancing and optimizing his website for a number of years. He's had to adapt and comply with the needs of the websites for which he provides enhanced searches. He's proven himself ethical and accommodative. While it is entirely possible—even highly probable—that his website had a negative effect on the Italian Genealogical Group and the German Genealogy Group websites, we have every confidence that Morse will surely cooperate with the website owners and quite possibly collaborate with them in ways only he can.

2 comments:

  1. I would be happy to hear about any solution that could be acceptable for the parties involved. Data provided in XML format? Special defined protocols?

    ReplyDelete
  2. It appears to me that Morse takes the simplest and most direct approach: communicate.

    The Italian Genealogical Group didn't like Morse bypassing their plea for contributions. Now his webpage includes the message, "If you'd like to support the Italian genealogical Group in this work, click here."

    Beyond that, both IGG and GGG give errors now when one does a One Step search of their data without having first seeing certain pages of their websites. I assume Morse assisted them in adding this technology to their websites.

    We read somewhere (apologies to that blogger) that Morse had also helped them add "sounds like" searches to their website.

    Communication is the key.

    Contrast this with Ancestry's Internet Biographical Collection (IBC). Their standard MO of developing in secrecy gave them no protection against misjudging their market. As a result, the IBC went down in flames.

    Radix, thanks for your comment.

    --The Insider

    ReplyDelete