Status Update: 9/11/2013
So, that stunk.
Yesterday, some of our customers lost access to their event sites during the afternoon. We had a database server fail, and fail such that it wouldn’t restart. We rebuilt the database on another server, and restored access to those sites as of 8:28pm Pacific last night.
We believe that the affected customers may have lost data entered in the three minutes before the crash; please check any work you may have been doing during that time.
What are we doing to prevent this from happening in the future?
Six months ago, we started a rolling upgrade of our servers and software; it has generally been going quite well. The database server that failed yesterday was scheduled to be replaced next week - the new server is already installed in the data center and “burned in.” We did not move the database to that new servers during the recovery yesterday; we instead went to the server that was part of the backup plan already in place. But we plan to accelerate the move to the new server asap.
Consequently, we will be scheduling a maintenance window this week; we will let everyone know about that as soon as it is scheduled.