Topic: Unscheduled outage 6/29/12

We're currently experiencing an unscheduled outage to the main chumby.com service.

We're working to bring the system back online.

Re: Unscheduled outage 6/29/12

Service has been restored to normal at approximately 3pm PDT.

Thanks to chumbians JoeFox and adh for help above and beyond.

Re: Unscheduled outage 6/29/12

As some of you know, we went down again today.

Here's what's going on - Amazon's us-east-1 datacenter has been having some issues with its Relational Database Services (RDS), which is the database system holding all of the chumby data.

What appears to be happening is frequent premature disconnects between the EC2 instances running the web servers and the main database.  MySQL has a trigger in it that when too many premature disconnects occur without a successful connection, it assumes it's being hacked and blocks incoming connections from that server until a command is explicitly given to it to clear the error and resume accepting connections.

During all of the time the system appeared to be down, it really wasn't - the database was actually running and completely operational from a parallel web server hosted under "insignia.chumby.com", which we use to provide a branded experience for Infocast and Insignia TV users.  It had just blocked the systems that are used most frequently.  All of the web servers, the forum, wiki, content servers were all up and running.

To compound the problem there was a storm on Friday night that greatly impaired RDS at that datacenter, and as it came back up, it ended up producing the same kind of disconnect errors, and the same trigger happened.

As of this writing, that issue is still ongoing and the RDS service in us-east-1 is still impaired.  Note that several other companies - Pinterest, Heroku, Instagram and others are being similarly impaired.

There is a way to change the threshold for the trigger, which we'll try to get to - however, as long as Amazon is having these issues, this may happen again without notice. We'll try to stay on top of it, but given that we all have other responsibilities now, there may be some delay.