google-gmail-plus-140513-640x436

Google Does Damage Control; Issues Apology And Explanation For Yesterday’s Service Outage

January 25, 2014 - Written By Orlando Lambert

 

If you’re anything like me, you use a lot of different Google products throughout the day. I, for one, spend most of my day in Gmail, Hangouts and Google Plus. I tend to believe and feel that Google is pretty reliable when it comes to making sure that the products I use are up and running all the time. But sometimes Murphy’s Law does strike, even to the best of us. Apparently, even our favorite provider and innovator of Android is not immune. Yesterday, I attempted to access my Gmail and Google Calendar and was met with a notice that the service was unavailable. Apparently I wasn’t the only one as more of Google’s users reported the outage. It wasn’t long before Google fixed the issue and got everything running again. But, as anyone with a reputation for reliability knows, you should probably apologize to your customers when something they depend on doesn’t work.

To let users know that Google recognized the problem, Ben Traynor, Google’s VP of Engineering went to the Official Google Blog to make a formal apology for the outage, saying “we strive to make all of Google’s services available and fast for you, all the time, and we missed the mark today.” Traynor acknowledges in the post that for users that tried to access some lost service for 25 to 30 minutes. He states some lost it even longer than that. So, like me, I’m sure some users were pretty frustrated and wanted an explanation for our beloved Google products not working properly or at all. Luckily, Traynor provides a reason for  why things went down. The following is an excerpt stating the reason.

 “At 10:55 a.m. PST this morning, an internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes, caused users’ requests for their data to be ignored, and those services, in turn, generated errors. Users began seeing these errors on affected services at 11:02 a.m., and at that time our internal monitoring alerted Google’s Site Reliability Team.”

The bottom line is this: there was a bug in the system that apparently communicates among all the different services we use letting them know how to work and operate. This resulted in Google’s many services sending up errors for their users. Either way, it was an inconvenience for Googlers like myself and Google had the situation fixed by 11:30 a.m. PST. Hopefully for most of Google’s users and our Android Headlines readers, this outage didn’t last long.

But this event will definitely bring some changes as Traynor went on to state how his team planned to prevent the issue from happening in the future. There were three things Traynor said he and his team would do to prevent issues in the future “1. Correcting the bug in the configuration generator to prevent recurrence, and auditing all other critical configuration generation systems to ensure they do not contain a similar bug. 2. Adding additional input validation checks for configurations, so that a bad configuration generated in the future will not result in service disruption. 3. Adding additional targeted monitoring to more quickly detect and diagnose the cause of service failure.” With these changes in place I hope that Google does, indeed, prevent the outage of their services from occurring in the future. I love Google’s products like Gmail, Calendar, Drive, Docs and more. And, I know all of our readers do too. Good job to Ben Treynor giving an explanation for it and helping Googlers understand exactly what happened.

Source: Google Official Blog