Thursday, February 25, 2010

Major Failure of Utah Computer Center

Fire suppression system in Ancestry.com data center This past weekend a major failure in a Utah data center brought down hundreds of websites, including at least one genealogy site. With the St. George Family History Expo starting tomorrow, the Family History Expos website (www.fhexpos.com) has been down all week. A notice on their website assures attendees that the conference will go forward as planned, but if you wish to register at the door, bring check or cash. Their credit card software was on their website and they may not have it up again by the time the conference begins.

A test of the fire system at Consonus Technologies’s Utah data center went awry, flooding the room with fire suppressant, crashing and damaging hundreds or thousands of servers. Back in January I toured the Ancestry.com data center along with other attendees of Ancestry.com Bloggers Day. We were shown their fire suppression system and warned to evacuate immediately if the claxon sounded.

The exact number and identities of affected websites is unknown. One Consonus customer is Westhost, a web hosting company. A web hosting company provides various web-related services such as websites, Internet connectivity, email, web addresses, and e-commerce services. Another Consonus customer is vps.net, a new type of hosting company that provides a service called cloud computing. Each of these web hosting companies may have hundreds or even thousands of customers.

A FamilySearch Indexing user reported problems in the hours following the failure, but I don’t know if they were related. (Fortunately, I don’t know, otherwise I wouldn’t be able to speculate for you.) User donaldbuncie reported experiencing an identity verification problem followed by “error starting the application” messages. Later on Monday a moderator stated, “Those problems from today and this weekend should be resolved now, you can try those projects again. Thanks for your patience!”

Detailed Timeline

Documents obtained exclusively by the Ancestry Insider give a detailed view of the failure and subsequent events. (All times are Mountain Standard Time.)

Saturday, 20 February 2010

  • 8:50 AM – Fire suppression system disarmed. Fire system placed into maintenance mode.
  • 9:10 AM – Pre-test checklist executed.
  • 9:20 AM – Testing begins.
  • 2:18 PM – Nine of ten actuators are disabled preparatory to rearming the fire suppression system. One is inadvertently overlooked.
  • 2:21 PM – Fire suppression system is rearmed and immediately activated, flooding the data center with fire suppressant. The fire suppressant damaged servers and disk drives.
  • 2:45 PM – Power failure detected by customers.
  • 3:05 PM – Consonus reported the failed fire test to customers.
  • 3:08 PM – Power restored.
  • 4:14 PM – vps.net confirmed disk drive failures, but redundant disks prevented data loss. Technicians began replacing failed disk drives.

Sunday, 21 February 2010

  • 6:55 PM – Consonus official incident report reported by customer to end users.

Monday, 22 February 2010

  • 9:00 AM – Westhost reported replacement of damaged hardware. Majority of servers back online. Some data loss will require restoration from tape backup.
  • 5:13 PM Westhost reported all services restored except 6 dedicated servers and 12 shared.

Tuesday, 23 February 2010

  • 1:25 PM – Westhost reported that restoration from tape backup has been hampered because the backup servers were also damaged. Ten shared servers and some dedicated servers are still down.
  • 9:54 PM – In Westhost’s last report they stated that a few dedicated servers would be restored in the next 4-6 hours.

Wednesday, 24 February 2010

  • 11:00 PM - Family History Expos updated the notice on their website. Their web hosting service reported that they are still having problems restoring backups. Family History Expos lamented that they “are probably last on their list” but will make the St. George Expo “a GREAT event and you should come and see for yourself.”

2 comments:

  1. Fire suppression products called 'Clean Agents' are not mislabeled, or a fallacy; they are clean. They do not damage the products they are intended to protect. Major testing organizations verify this attribute before they are ever put on the market.

    Fire equipment organizations doing maintenance and testing on systems that they are not certified to perform is the cause for this problem. The service company failed, not the system. This sounds like a service company trying to blame a product for their errors.

    If you’re a professional service company how can you not disconnect all of the cylinders? This is a violation of the most elementary aspects of Fire Service 101. There are many instances of successful discharges where damage was averted by the clean agent, not the cause of it.

    ReplyDelete
  2. I agree with Bob. We have never had a loss do to a discharge of agent. In all cases, any and all damage has been done by a technician or human error. There is a reason why there are distributors for different agent and system types. Distributors are required to go to schools to properly learn the equipment that they are working on. Fire Engineering Company is a leading Fike distributor, and would like the opportunity to help in anyway possible.

    Troy Kinder
    Fire Engineering Co.
    801-703-1466

    ReplyDelete