22nd October 2018

503 errors on Pressero

Pressero services in CHI experienced an 8 minutes outage, from 11:48 AM to 11:56 PM CT. The outage was caused by an automated configuration update that failed to apply last Friday, Oct. 19th, on our two slave load balancers. During a network packet loss incident on one of our load balancers, no servers were left available due to the missing configuration. As soon as we got internally notified, we triggered the switch of the master load balancer through a failover process, causing the network issue to cease. The automated process was forced to run again, this time successfully updating the configuration file on all of our load balancer instances. The network packet loss problem only applies to a particular version of OS that is running on some of our servers which are being changed.