An Apology for Yesterday’s Outage at our Dallas, TX Data Center

Posted on 16 Jul 2015 by Adam

As some of you may know, we experienced a power outage at one of our US-based datacenters located in Dallas, TX. This affected those with dedicated servers at this location, along with clients on the following shared/reseller servers:

rs37.abstractdns.com
rs41.abstractdns.com
rs42.abstractdns.com
server41.abstractdns.com
server42.abstractdns.com

This type of incident is extremely rare for us, and although we a have redundant infrastructure (power & network) in place at this location, the outage was more complex than most, as detailed below.

What happened?

A power outage, believed to have been caused by a short circuit, occurred at one of our US-based datacenters located in Dallas, TX at approx. 5:30AM CDT on Wednesday 15th.

A backup power supply was in place (hence us advertising this on our website), however this failed for some reason – we’re still awaiting further details on this.

This impacted 3 of our 5 core switches on-site. Two were down for approximately an hour, but the final switch required a manual boot command via KVM.

Most of the problems experienced by our clients related to this failed switch, as the DC responded slowly to our requests to physically inspect the switch, reboot, attach the KVM etc. We usually have quick responses from this specific datacenter, however we believe the slow response in this case related to the overwhelming amount of requests to them from other clients having similar problems.

The issue with the final switch was resolved at 6PM CDT when the datacenter staff finally attached the KVM device to our switch, enabling us to immediately reboot and reinstate network connectivity to this specific rack, enabling our clients to access their servers again.

Whilst this was going on, our senior technical team made preparations for a ‘plan B’, whereby we’d physically move the servers to another rack with a backup switch, however waiting for the DC staff would be much quicker than having to co-ordinate the physical moving of servers at such a time.

We did all we could to resolve this incident as quickly as possible, but the slow response of the DC staff impacted on the resolution time.

What will you be doing to prevent similar incidents?

To ensure we don’t experience issues like this again, we’ll be working with the datacenter on a plan going forward, and will also consider installing additional switches for maximum possible redundancy.

We’re also looking at ways of better communicating with our clients regarding downtime incidents (scheduled and un-scheduled). This will likely be with the launch of a new status page making it easier to get notified.

Any clients affected by these issues should contact our support department by submitting a ticket.