Reliability – Made by Mikal

Should SRE management hold blameless retrospectives for mistakes?

Posted on May 4, 2024May 4, 2024 by mikal

One of the core tenets of Site Reliability Engineering (SRE) is that blameless postmortems / retrospectives should be held for oncall incidents. Its part of the continuous improvement process where we learn from what went wrong and try and create processes to ensure it doesn’t happen again. Very explicitly it is not about blaming anyone…

Please don’t

Posted on January 21, 2007April 9, 2018 by mikal

A fresh cup mentions the Ruby on Rails exception notifier plugin. The idea is that every time an exception is raised in your code you get an email. This is such a horrible idea that I need to take the time to comment. As someone who spends all his time dealing with large deployments of…