[Resolved] Partial outage on Xen cloud in Amsterdam 1 datacenter

A power distribution unit failure at our Amsterdam 1 datacenter has this morning taken out about 1/4 of our Xen hosts located in Amsterdam 1. Engineers are en route to switch it out.

In the mean time we have rebooted all affected virtual servers using spare capacity on the remaining Xen hosts. No loss of data has occurred as our redundant centralised storage servers have not been affected.

Our customer portal, shared.anu.net Lasso/PHP hosting server and a handful of customer VMs were briefly affected by the outage but have all now been restored.

SpamTitan TLS negotiation issue

It came to our attention this morning that email from Gmail accounts was not getting through to our Hosted Email customers. After running every test we could think of we discovered Gmail’s servers were failing TLS negotiation with SpamTitan, and not failing back to unencrypted SMTP. We have temporarily disabled TLS support on our SpamTitan server and mail is now coming through again. Any messages sent from Gmail since we restored SpamTitan yesterday morning will be delivered to your inboxes shortly.

This issue may also have affected some other email providers, though Gmail is the only major provider we are aware of that was behaving this way.

Review of backup & disaster recovery capabilites for key systems

To date we’ve had very little downtime on our SpamTitan and Hosted Email systems. We feel our solid hardware and software infrastructure has played a strong role here, but we also realise that failures do happen no matter how robust the systems are. So today as part of our post-disaster recovery analysis of yesterday’s SpamTitan server failure, we are taking a few minutes to review our backup and DR strategies for all our critical systems, in a bid to highlight areas where improvements can be made and to help us plan accordingly.

We feel this information deserves to be made public as our customers rely on us to provide reliable service and to have a plan for when the inevitable happens and something goes wrong.

Hosted Email: mail is stored on a CentOS Linux virtual machine on our Xen platform. The storage for the VM is on a hardware RAID array with redundant disks. The VM is backed up daily to our central on-site backup server.

Planned upgrades: install a 2nd mail server and replicate the mail store in near real time, for example using DRBD. This will provide faster disaster recovery capability and decreased risk of data loss due to hardware failure, such as the total failure that happened yesterday on our SpamTitan server. Email is high volume and very valuable data, which we feel warrants additional redundancy.

SpamTitan: hosted on a dedicated server with daily backup of all configuration data to central backup server.

Planned upgrades: we have a 2nd SpamTitan server on order which we will keep at the ready for disaster recovery.

DNS: We operate two distinct DNS hosting systems, our standalone BIND servers and the DNS service integrated into our DirectAdmin shared hosting system.

Our standalone BIND servers ns1.anu.net, ns2.anu.net and ns3.anu.net are geographically distributed and each server is backed up daily. The DirectAdmin DNS service has an off-site slave server and both servers are backed up daily.

Virtual Servers: as standard we provide a daily backup of all virtual servers to our backup servers in Amsterdam 1, Amsterdam 2 and Chicago datacenters. The backups are designed for disaster recovery purposes not data retention, and are updated daily. Many customers have opted to purchase additional backup virtual servers providing disaster recovery capabilities and a 30 day backup history enabling recovery of accidentally lost or modified files/databases (see related blog post on cwik.ch). This service is configured individually depending on customer requirements.

Shared Hosting: our DirectAdmin shared hosting system operates from dedicated hardware. The physical server has a hardware RAID controller and data is stored on a 4-SSD RAID 10 array, which can withstand multiple drive failures. All data is backed up on the hour to our central backup server, which keeps 4 snapshots per day plus a 30 day snapshot history. Our disaster recovery plan should the hardware fail is to restore service on a Xen virtual machine until the hardware can be repaired or replaced.

VPN: each of our datacenters has a PPTP VPN server hosted on a physical server. The VPN account data is backed up daily. We do not have failover systems but can manually restore VPN services from backed up data within an hour should the hardware fail.

Web site & Customer Portal: our customer facing Web systems are hosted on Xen virtual machines, which are backed up daily to our central backup servers. DR plan should the Xen host’s hardware fail is as with any Xen VM (restore to another Xen host from backup server)