When it comes to their cloud services, Google is extremely cautious. Just about every piece of data is backed up to multiple networked machines in some form, and things are kept running with thousands of different nodes. This strategy didn't do them much good a while back when a wayward software bug knocked their entire Compute Engine platform offline for part of a day, but definitely paid off on August 11. On that day, something happened to Google's cloud-based App Engine that set alarm bells ringing and error messages flying for Google, and resulted in users of their platform seeing apps error out or take longer to load then normal. The whole incident lasted about two hours before Google managed to get a handle on things, and in typical Google fashion, they've embraced accountability and clarity by explaining all the gory details of what happened.
The whole ordeal took place between 1:13 PM and 3:00 PM Pacific Time. The incident took place during a routine moving of app cores between data centers to balance out traffic. Usually, the apps cross over in bits and pieces, and once the core apps are on the target servers, their traffic starts getting "rescheduled", or sent over to the new server automatically. On this occasion, however, a cluster of routers that happened to be the one playing man in the middle for the app migration chose right then to update their firmware, which triggered a mass restart. Normally, such a thing would last until the routers were rebooted, but the programming for Google's app servers doesn't exactly factor in errors and failures like this. When requests had to be routed around the bum routers and started taking longer, the automation processes started thinking that the target computers were ignoring them, and sent multiple duplicate requests. This made things incredibly unstable when the routers woke up and found themselves inundated with millions if not billions of requests.
Normally, this would have resulted in the routers crashing outright and servers' traffic coming to a grinding halt; in short, total bedlam. Google's people were luckily on top of it from the start. To fix the crash, Google engineers who were watching it all happen killed as many of the duplicate requests as they could, and redirected the original ones to servers not linked to the rogue routers. The servers they chose, however, were in use for another project, and shoehorning Google's App Engine traffic onto them resulted in significant slowdown. At 3:00 on the dot, engineers finished making a temporary configuration change to the new servers, and traffic started flowing back to the initial target servers. For all intents and purposes, things were normal. To prevent this sort of thing in the future, Google plans to modify the system's retry behavior and add in more routers, ensuring that they are on different schedules so that an entire cluster doesn't suddenly go dark like in this incident.