Should SRE management hold blameless retrospectives for mistakes?

One of the core tenets of Site Reliability Engineering (SRE) is that blameless postmortems / retrospectives should be held for oncall incidents. Its part of the continuous improvement process where we learn from what went wrong and try and create processes to ensure it doesn't happen again. Very explicitly it is not about blaming anyone for an error -- the worst personal outcome for an engineer should be a decision that perhaps we failed to train them adequately. A simple example might be if a system broke because it ran out of disk, you might determine in the retrospective that it would be a super good idea to have some sort of alert for low disk space fire well before the system broke so you could intervene. It occurs to me that I've never seen the same process used for SRE management though, and that seems like an obvious gap to me now. Surely the same process of asking what went wrong and working out what mechanisms could be created to ensure that we're at least making new mistakes next time would be a good idea? Yet I've never seen a SRE management team willing to actually hold itself to…

Continue ReadingShould SRE management hold blameless retrospectives for mistakes?

Please don’t

A fresh cup mentions the Ruby on Rails exception notifier plugin. The idea is that every time an exception is raised in your code you get an email. This is such a horrible idea that I need to take the time to comment. As someone who spends all his time dealing with large deployments of software, email is the worst way of reporting errors I can think of. Think about it: Email is unreliable to deliver. It could get queued on the reporting server, a mail router on the network, or on your delivery server. Worse than that, it could get marked as spam, or randomly discarded. Email is expensive. There are two kinds of expense here -- email needs to be written to disk reliably, which means you sync() when you write the mail to a destination or a queue. For some MTAs, this can mean several syncs() per email as the mail moves between queues. There can be more than one of these MTAs on the way to the final delivery target as well. Additionally, storing email at the destination is expensive. Think of the backups, virus scanning, spam scanning, caching on clients and so forth. Email is…

Continue ReadingPlease don’t

End of content

No more pages to load