Review of backup & disaster recovery capabilites for key systems

To date we’ve had very little downtime on our SpamTitan and Hosted Email systems. We feel our solid hardware and software infrastructure has played a strong role here, but we also realise that failures do happen no matter how robust the systems are. So today as part of our post-disaster recovery analysis of yesterday’s SpamTitan server failure, we are taking a few minutes to review our backup and DR strategies for all our critical systems, in a bid to highlight areas where improvements can be made and to help us plan accordingly.

We feel this information deserves to be made public as our customers rely on us to provide reliable service and to have a plan for when the inevitable happens and something goes wrong.

Hosted Email: mail is stored on a CentOS Linux virtual machine on our Xen platform. The storage for the VM is on a hardware RAID array with redundant disks. The VM is backed up daily to our central on-site backup server.

Planned upgrades: install a 2nd mail server and replicate the mail store in near real time, for example using DRBD. This will provide faster disaster recovery capability and decreased risk of data loss due to hardware failure, such as the total failure that happened yesterday on our SpamTitan server. Email is high volume and very valuable data, which we feel warrants additional redundancy.

SpamTitan: hosted on a dedicated server with daily backup of all configuration data to central backup server.

Planned upgrades: we have a 2nd SpamTitan server on order which we will keep at the ready for disaster recovery.

DNS: We operate two distinct DNS hosting systems, our standalone BIND servers and the DNS service integrated into our DirectAdmin shared hosting system.

Our standalone BIND servers ns1.anu.net, ns2.anu.net and ns3.anu.net are geographically distributed and each server is backed up daily. The DirectAdmin DNS service has an off-site slave server and both servers are backed up daily.

Virtual Servers: as standard we provide a daily backup of all virtual servers to our backup servers in Amsterdam 1, Amsterdam 2 and Chicago datacenters. The backups are designed for disaster recovery purposes not data retention, and are updated daily. Many customers have opted to purchase additional backup virtual servers providing disaster recovery capabilities and a 30 day backup history enabling recovery of accidentally lost or modified files/databases (see related blog post on cwik.ch). This service is configured individually depending on customer requirements.

Shared Hosting: our DirectAdmin shared hosting system operates from dedicated hardware. The physical server has a hardware RAID controller and data is stored on a 4-SSD RAID 10 array, which can withstand multiple drive failures. All data is backed up on the hour to our central backup server, which keeps 4 snapshots per day plus a 30 day snapshot history. Our disaster recovery plan should the hardware fail is to restore service on a Xen virtual machine until the hardware can be repaired or replaced.

VPN: each of our datacenters has a PPTP VPN server hosted on a physical server. The VPN account data is backed up daily. We do not have failover systems but can manually restore VPN services from backed up data within an hour should the hardware fail.

Web site & Customer Portal: our customer facing Web systems are hosted on Xen virtual machines, which are backed up daily to our central backup servers. DR plan should the Xen host’s hardware fail is as with any Xen VM (restore to another Xen host from backup server)

[Resolved] SpamTitan email filtering outage

Update 11:45 CET: full service restored, SpamTitan is back in action!

Update 11:00 CET: while SpamTitan continue the restore process, we have put in rudimentary spam and virus filtering on our backup system to cut out the most obvious spam and viruses.

Update 2013-10-01 10:10 CET: SpamTitan support are on the case, shouldn’t be much longer now.

Update 23:30 CET: replacement hardware is in place and SpamTitan is running, however it appears our backup may have grown too large to restore via the Web GUI. Waiting for vendor support to proceed.

Update 20:55 CET: our technician is still working on the hardware, but it’s not looking promising. Time for plan B, a new install on alternate hardware. We do maintain a daily backup of the SpamTitan configuration so will be able to restore service, but any false positives stuck in quarantine may be lost.

11:00 CET: We are currently experiencing an outage on our spam filtering system SpamTitan. This affects incoming email for most of our Hosted Email customers, and customers with their own mail servers for whom we are providing spam filtering.

The root cause is a hardware failure. We have an engineer en route with replacement hardware, however the ETA for a fix is not before 18:30 CET. In the mean time we have implemented a bypass system for incoming email so messages will be delivered, but without being filtered. You may therefore notice a large increase in the amount of spam arriving in your inbox today.

We apologise for the inconvenience caused.

If you are experiencing bounced emails or any non-spam related email problems please do not hesitate to contact us.

Update to Customer Portal

We launched our new Customer Portal back in April and have had great customer feedback about it. There was however one issue which we were aware of, which a thoughtful customer recently reminded us of: the interface didn’t work on iOS devices.

Actually it didn’t work well with mobile devices in general. Our initial plan was to introduce a mobile-friendly version, but we like to be on the cutting edge of technology and having a separate mobile version of a site is so last year!

Thankfully Bootstrap 3.0 was just released and is now fully responsive by default. This means sites built using the Bootstrap framework will automatically rearrange themselves to best fit the display in use. For example the main dashboard in our Customer Portal is a grid on a normal display, but turns into a list when viewed on a narrow display such as a mobile phone. You can see this in action by logging in to our customer portal on your iPad oriented vertically, then turning it horizontally.

Grids, tables, navigation menus and forms are all responsive now.

Upgrading was not a trivial task as a lot has changed between Bootstrap 2 and 3, but we think the result was worth it. Now our portal not only works but works beautifully on desktops, tablets and mobile devices!