Post-mortem: December 7th

Hi all,

On December 7th we had elevated error rates in our database throughout the peak of the workday. This caused many users to be unable to update their Streak boxes and contacts, and caused sporadic issues loading data throughout the day.

We know you rely on Streak to keep your business running and we apologize for the problems that this caused you. I wanted to give a little bit of context on what happened, how we resolved the issue, and what we’re doing to make sure it doesn’t happen again.

For background, Streak historically has been powered by the Google Cloud Datastore database. Cloud Datastore is very reliable and easy to maintain, but it’s restrictive in the way in which we can access data. For instance, if we want to get all boxes connected to the contacts in an email thread, we have to first manually fetch all of the contacts connected to the thread, and then in a second step manually fetch all of the boxes connected to those contacts. This makes the Streak experience slower and limits the amount of context we can give you about who you’re talking to, which means more manual work for you.

To better support this, we’ve been migrating some data from Cloud Datastore onto a platform based on MySQL, an industry standard database that provides better support for these kinds of context-based questions. When we started the migration, we ran into some problems early on where the query performance was limited by Google’s hosted MySQL service’s disk performance. To work around this limitation, we dramatically over-provisioned processing power and memory to make up for the disk performance shortfall.

We wanted to make sure we had a stable foundation for future work, so on Sunday, we moved to a different set of instances that have much better disk performance. As part of that move, we moved back to the instance size we were using before the disk issues. Unfortunately, in the intervening period, we had deployed additional queries that legitimately used more of the additional processing power than we anticipated. Unfortunately, this didn’t become evident until we hit the workday peak. Since our database was at full capacity, it wasn’t feasible to migrate to a larger instance until after the workload lessened as folks signed off for the evening in Europe and North America. We made some gains by optimizing queries, but the error rate and latency metrics remained higher than is acceptable for the remainder of the workday.

On the evening of December 7th, we migrated to instances that have both the better disk performance and the higher processing power and memory. Our metrics are back to their target range, and we’ve added additional monitoring in this area.

We’ve also taken process steps to ensure that where possible we add additional capacity in advance of needing it in future migrations. We appreciate your trust in us and apologize again for the issue.

Sincerely,

Fred Wulff
Engineering @ Streak

Post-mortem: December 7th was originally published in Streak on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue reading

This post was originally published on this site