The Cloudmin VM management system for our Amsterdam 2 datacenter crashed last night (11th October). The crash was due to a bug in the VM status collection system which caused it to use excessive resources and eventually run out of memory altogether.
We have resolved this issue by restarting the Cloudmin server. There is also an unofficial patch for this bug which we have now applied, pending the next maintenance release of Cloudmin which will fix this known issue.
Services affected: DNS resolution for ams2-cloudmin.anu.net zone, Web GUI VM management for VMs in Amsterdam 2 datacenter, API/customer portal management of VMs in Amsterdam 2 datacenter.
A power distribution unit failure at our Amsterdam 1 datacenter has this morning taken out about 1/4 of our Xen hosts located in Amsterdam 1. Engineers are en route to switch it out.
In the mean time we have rebooted all affected virtual servers using spare capacity on the remaining Xen hosts. No loss of data has occurred as our redundant centralised storage servers have not been affected.
Our customer portal, shared.anu.net Lasso/PHP hosting server and a handful of customer VMs were briefly affected by the outage but have all now been restored.
Update 11:45 CET: full service restored, SpamTitan is back in action!
Update 11:00 CET: while SpamTitan continue the restore process, we have put in rudimentary spam and virus filtering on our backup system to cut out the most obvious spam and viruses.
Update 2013-10-01 10:10 CET: SpamTitan support are on the case, shouldn’t be much longer now.
Update 23:30 CET: replacement hardware is in place and SpamTitan is running, however it appears our backup may have grown too large to restore via the Web GUI. Waiting for vendor support to proceed.
Update 20:55 CET: our technician is still working on the hardware, but it’s not looking promising. Time for plan B, a new install on alternate hardware. We do maintain a daily backup of the SpamTitan configuration so will be able to restore service, but any false positives stuck in quarantine may be lost.
11:00 CET: We are currently experiencing an outage on our spam filtering system SpamTitan. This affects incoming email for most of our Hosted Email customers, and customers with their own mail servers for whom we are providing spam filtering.
The root cause is a hardware failure. We have an engineer en route with replacement hardware, however the ETA for a fix is not before 18:30 CET. In the mean time we have implemented a bypass system for incoming email so messages will be delivered, but without being filtered. You may therefore notice a large increase in the amount of spam arriving in your inbox today.
We apologise for the inconvenience caused.
If you are experiencing bounced emails or any non-spam related email problems please do not hesitate to contact us.