Last week, our site sustained a prolonged outage during core business hours. While testing their backup power systems, our data centre provider tripped a breaker leading to a cascade of failures that, among other things, produced a power surge that fried our hardware firewall’s power supply. The hardware firewall is one of those standard pieces of system hardware that are so simple that they are assumed to be failure resistant – one of the pieces of an infrastructure least likely to fail. The reality is that they are antiquated, commodity hardware that the host provider has long ago paid off and that have sustained the load of numerous sites before ours. The question is not if they’re going to fail, but when. And the implications of their failure is quite severe.
By design, the firewall serves as the single access point into and out of the site. Even though we had taken redundancy and failover measures in the web server, application server and database clusters behind the firewall, it doesn’t matter much if the traffic can’t get though. Essentially the hardware firewall is a big old SPOF.
Normally a hardware power supply is one of those things that a data centre can very quickly replace. But when the data centre itself is in turmoil because of a significant outage, replacing a power supply for some small customer is the last thing on their mind. When it comes down to it, the only one who cares about your site is you and your customers. We, of course, knew about the failure immediately because of the monitoring we have in place. But there wasn’t much we could do. When managed hardware fails, there’s not much you can do except log a ticket (assuming that the ticketing system is up – which in this case it wasn’t), sit back and wait. Of course there are SLAs in place (there’s a one hour replacement window on these types of things), but they don’t keep your site up and going through the negotiations to sort out the ramifications of a failure are a big waste of everyone’s time. The bottom line is that we need to eliminate this SPOF from our infrastructure by obtaining a secondary firewall that we can failover to.
80% technical, 20% social change. This blog is dedicated to finding ways to sustainably release software more frequently.