Google Explains App Engine Errors from August 11th, 2016

When it comes to their cloud services, Google is extremely cautious. Just about every piece of data is backed up to multiple networked machines in some form, and things are kept running with thousands of different nodes. This strategy didn't do them much good a while back when a wayward software bug knocked their entire Compute Engine platform offline for part of a day, but definitely paid off on August 11. On that day, something happened to Google's cloud-based App Engine that set alarm bells ringing and error messages flying for Google, and resulted in users of their platform seeing apps error out or take longer to load then normal. The whole incident lasted about two hours before Google managed to get a handle on things, and in typical Google fashion, they've embraced accountability and clarity by explaining all the gory details of what happened.

The whole ordeal took place between 1:13 PM and 3:00 PM Pacific Time. The incident took place during a routine moving of app cores between data centers to balance out traffic. Usually, the apps cross over in bits and pieces, and once the core apps are on the target servers, their traffic starts getting "rescheduled", or sent over to the new server automatically. On this occasion, however, a cluster of routers that happened to be the one playing man in the middle for the app migration chose right then to update their firmware, which triggered a mass restart. Normally, such a thing would last until the routers were rebooted, but the programming for Google's app servers doesn't exactly factor in errors and failures like this. When requests had to be routed around the bum routers and started taking longer, the automation processes started thinking that the target computers were ignoring them, and sent multiple duplicate requests. This made things incredibly unstable when the routers woke up and found themselves inundated with millions if not billions of requests.

Normally, this would have resulted in the routers crashing outright and servers' traffic coming to a grinding halt; in short, total bedlam. Google's people were luckily on top of it from the start. To fix the crash, Google engineers who were watching it all happen killed as many of the duplicate requests as they could, and redirected the original ones to servers not linked to the rogue routers. The servers they chose, however, were in use for another project, and shoehorning Google's App Engine traffic onto them resulted in significant slowdown. At 3:00 on the dot, engineers finished making a temporary configuration change to the new servers, and traffic started flowing back to the initial target servers. For all intents and purposes, things were normal. To prevent this sort of thing in the future, Google plans to modify the system's retry behavior and add in more routers, ensuring that they are on different schedules so that an entire cluster doesn't suddenly go dark like in this incident.

Copyright ©2019 Android Headlines. All Rights Reserved
This post may contain affiliate links. See our privacy policy for more information.
You May Like These
More Like This:
About the Author
2018/10/Daniel-Fuller-2018.jpg

Daniel Fuller

Senior Staff Writer
Daniel has been writing for Android Headlines since 2015, and is one of the site's Senior Staff Writers. He's been living the Android life since 2010, and has been interested in technology of all sorts since childhood. His personal, educational and professional backgrounds in computer science, gaming, literature, and music leave him uniquely equipped to handle a wide range of news topics for the site. These include the likes of machine learning, voice assistants, AI technology development, and hot gaming news in the Android world. Contact him at [email protected]