Last October, I published a blog post describing the efforts we’ve committed to on the Bitbucket Cloud engineering team to achieve world-class reliability. A lot has happened in the past year (understatement of the year)! What the team has accomplished is tremendous, but we’ve also learned a thing or two that we can work further to improve. In this post, I’d like to address our recent reliability issues, the lessons we’ve learned from it, and provide an update on some of the performance work we’ve done over the last 12 months.
Living up to one of Atlassian’s values, open company no bullsh*t, we wanted to lift the curtain and provide an overview of the reliability issues we have seen in October. We strive for 99.9% across the board but we have not lived up to this goal consistently. The team and I know all too well that even if our services are available the vast majority of the time, thirty minutes of degraded performance can be incredibly disruptive, especially if it occurs during your team’s core working hours.
Over the past few weeks, we’ve had several incidents that may have impacted your teams. These incidents have highlighted that there is still plenty of room for us to grow. One incident in particular lasted for 11 hours, and I want to share with you a little about what happened.
On the morning of October 6, automated alerts started notifying our engineering teams that something was wrong, including increased memory usage on some hosts, elevated error rates, and extended end-to-end delivery times for outbound webhooks. The incident response team quickly identified that many of our queues responsible for managing background tasks were backing up, with worker processes failing to process tasks quickly enough.
We were able to mitigate customer impact by making configuration changes to ease the pressure on our queuing infrastructure and restarting many of the worker processes that were failing to keep up with load, noting that some had completely run out of available memory. These changes led to some short-term improvement, but before long the same issues resurfaced: hosts were running out of memory, worker processes were dying, and queues were growing. As a result, many of our background processes, such as webhooks and merging pull requests, were failing or timing out.
Under normal circumstances, the team would have quickly rolled back to the prior release as a precautionary measure, to rule out a code change as the culprit for the incident. We did not pursue that option as quickly as we should have in this case, for a few reasons.
From our logging and metrics, we could see what looked like the beginning signs of this issue dating back almost a full week. It simply hadn’t crossed the threshold for alerting our teams until the morning of the incident. Rolling back a single release is one thing; but rolling back an entire week’s worth of changes carries very high risk?often more risk than we can tolerate, even in the face of a major incident, since the last thing we want to do is make things worse. The timing of the incident itself did not clearly line up with the latest code release. While this is no guarantee that the code release wasn’t responsible, it is typically an indicatorContinue reading