[Complete] SpamTitan server upgrade

2016-02-19 17:55 GMT

As announced 2 weeks ago, we are performing an upgrade on our SpamTitan server. We are commencing work now and will update this blog when complete.

There will be a brief interruption to filtering of inbound mail, we expect this to last no more than 10 minutes.

Update 18:20 GMT: the disruption was less than 5 minutes, and mail is now flowing through the new server. We’ve had to make a few manual config tweaks that weren’t carried over automatically by the configuration export/import tool, but overall the process went fairly smoothly.

Update 18:30 GMT: we’ve tested and monitored log files for the past 20 minutes and everything is running smoothly. The old SpamTitan server is available at https://oldspamtitan.anu.net/, we’ll leave it running for 14 days (after which all quarantined mail will have expired) in case any customers need to log in and retrieve any false positives from it.

A brief analysis of the ongoing anuhosting.net outage

While we wait and watch the painfully slow process of restoring our anuhosting.net shared hosting and reseller server, let me write a brief analysis of what happened and what’s going on.

Last year, we spent almost £20,000 updating our server and network hardware in our Amsterdam 1 datacenter, to accommodate continued growth in shared hosting and reseller hosting services. We put in 10Gbps Ethernet, new 16-core/256GB RAM servers and pure SSD RAID storage arrays that are over 10 times faster than spinning disks.

We planned (and still plan on) continuing to invest in our physical infrastructure this year, including adding high speed, replicated, redundant network attached storage arrays and new backup servers.

The server that failed this morning was just 8 months old. We can only assume at this point we ended up with 2 SSDs from a bad batch, and they ended up failing at almost the exact same time, leaving no time for the RAID to rebuild.

The storage for our anuhosting.net server was on the RAID array that failed, and was backed up daily to a separate backup server. Those incremental backups run over 1Gbps Ethernet, and run at night in the background.

The problem we are facing today is that the sheer volume of data needed to restore the server is taking many many hours to copy from the backup server’s spinning SATA disks, over 1Gbps Ethernet. Our (woefully inadequate) backup plan for a disaster like this was to take a fresh copy of all the data, put it on a new server and fire it up. We do this sort of thing regularly for routine jobs like cloning servers or testing, but what we failed to account for in this case is how long it would take to copy all the data.

Plan B was that instead of copying all the data, we’d spin up a new virtual machine and connect it directly to the data stored on the backup server. The theory was at least we could get sites back online, even if they ran somewhat slowly. Around mid day today, we made the call to implement plan B. Again though, we failed to account for the I/O bandwidth required to run anuhosting.net, and almost as soon as we booted it up the server crashed due to insufficient I/O.

So we came up with plan C: continue copying the data from the backup server in the background, and start minimal services on anuhosting.net so we could at least restore some functionality.

That’s where we’re currently at: DNS and email are running, while the restore process continues in the background (albeit at a slower pace, due to increased I/O load).

Once the restore process completes, we will temporarily shut down DNS and email services again while we synchronise the latest changed data to the new server, and boot up from local storage. At this point we will be able to start up the Apache/PHP/MySQL servers again, as well as DNS and email.

We don’t know exactly how long this will take, at a guess 4 hours.

We know where we went wrong, we know what needs to be done to fix our infrastructure going forward. All we can ask is for continued patience and understanding from our customers while we keep working to restore service. Be assured we are working as fast as we possibly can.

[Resolved] Amsterdam 1 Server Failure

2016-02-15 06:00 GMT: 2 hard drives failed at the same time in a server at our Amsterdam 1 datacenter, causing the RAID array to fail. This has resulted in several clients and some internal infrastructure going offline (including our support helpdesk).

We are currently working hard to restore backups on to another server and will provide regular updates. If you want to contact us you can tweet @anuinternet.

Update 08:55 GMT: our support desk is back online, some virtual machines are already back online, others are still restoring from backup.

Update 10:15 GMT: all servers except the anuhosting.net shared/reseller server are now back online. anuhosting.net has been priority #1 since we started the restore procedure 4 hours ago, unfortunately it is also by far the largest and is taking quite some time to restore. We estimate it may take another 2-3 hours to complete.

Update 13:40 GMT: we are working on recovering services on anuhosting.net, we aim to have mail services running very shortly followed by MySQL and Apache/PHP. Apologies for the ongoing service interruption.

Update 14:00 GMT: DNS and mail on anuhosting.net are operational again. MySQL was unable to recover InnoDB to a functional state, so we are restoring the last consistent database snapshot available which is from 04:00 Sunday. We will make the recovered InnoDB databases from the latest backup available to anyone who wants to try to extract missing data from Sunday. Most of the data is there but we were unable to recover it to a fully functional state, so we made the decision to roll back to a known good copy. We expect PHP/MySQL/Apache services to be back online within 30 minutes.

Update 15:30 GMT: we are running into problem after problem with restoring Web functionality and do not currently have an ETA. We are working as fast as possible to recover MySQL, Apache and PHP services on anuhosting.net. All other services are currently operational. Our sincere apologies for the continued downtime on shared and reseller hosting servers.

Update 01:30 GMT: It’s been a very long day, thankfully at this point we can say our anuhosting.net server is finally operational again. We have spent the past half hour testing as many sites as possible and things seem to be running well. We are concerned there may be a handful of InnoDB tables with errors, if your site is not functioning 100% this may be the cause. Please contact support@anu.net ASAP and we will do what we can to help get you back up and running. We will be on hand tomorrow to answer any questions and help with any remaining issues.

A big thank you to all our customers for their patience, understanding and encouragement throughout this difficult day.

We will of course follow up with a detailed review of our storage systems, redundancy measures, backup and disaster recovery plans.