This past weekend a major failure in a Utah data center brought down hundreds of websites, including at least one genealogy site. With the St. George Family History Expo starting tomorrow, the Family History Expos website (www.fhexpos.com) has been down all week. A notice on their website assures attendees that the conference will go forward as planned, but if you wish to register at the door, bring check or cash. Their credit card software was on their website and they may not have it up again by the time the conference begins.
A test of the fire system at Consonus Technologies’s Utah data center went awry, flooding the room with fire suppressant, crashing and damaging hundreds or thousands of servers. Back in January I toured the Ancestry.com data center along with other attendees of Ancestry.com Bloggers Day. We were shown their fire suppression system and warned to evacuate immediately if the claxon sounded.
The exact number and identities of affected websites is unknown. One Consonus customer is Westhost, a web hosting company. A web hosting company provides various web-related services such as websites, Internet connectivity, email, web addresses, and e-commerce services. Another Consonus customer is vps.net, a new type of hosting company that provides a service called cloud computing. Each of these web hosting companies may have hundreds or even thousands of customers.
A FamilySearch Indexing user reported problems in the hours following the failure, but I don’t know if they were related. (Fortunately, I don’t know, otherwise I wouldn’t be able to speculate for you.) User donaldbuncie reported experiencing an identity verification problem followed by “error starting the application” messages. Later on Monday a moderator stated, “Those problems from today and this weekend should be resolved now, you can try those projects again. Thanks for your patience!”
Documents obtained exclusively by the Ancestry Insider give a detailed view of the failure and subsequent events. (All times are Mountain Standard Time.)
Saturday, 20 February 2010
- 8:50 AM – Fire suppression system disarmed. Fire system placed into maintenance mode.
- 9:10 AM – Pre-test checklist executed.
- 9:20 AM – Testing begins.
- 2:18 PM – Nine of ten actuators are disabled preparatory to rearming the fire suppression system. One is inadvertently overlooked.
- 2:21 PM – Fire suppression system is rearmed and immediately activated, flooding the data center with fire suppressant. The fire suppressant damaged servers and disk drives.
- 2:45 PM – Power failure detected by customers.
- 3:05 PM – Consonus reported the failed fire test to customers.
- 3:08 PM – Power restored.
- 4:14 PM – vps.net confirmed disk drive failures, but redundant disks prevented data loss. Technicians began replacing failed disk drives.
Sunday, 21 February 2010
- 6:55 PM – Consonus official incident report reported by customer to end users.
Monday, 22 February 2010
- 9:00 AM – Westhost reported replacement of damaged hardware. Majority of servers back online. Some data loss will require restoration from tape backup.
- 5:13 PM Westhost reported all services restored except 6 dedicated servers and 12 shared.
Tuesday, 23 February 2010
- 1:25 PM – Westhost reported that restoration from tape backup has been hampered because the backup servers were also damaged. Ten shared servers and some dedicated servers are still down.
- 9:54 PM – In Westhost’s last report they stated that a few dedicated servers would be restored in the next 4-6 hours.
Wednesday, 24 February 2010
- 11:00 PM - Family History Expos updated the notice on their website. Their web hosting service reported that they are still having problems restoring backups. Family History Expos lamented that they “are probably last on their list” but will make the St. George Expo “a GREAT event and you should come and see for yourself.”