devops – Made by Mikal

Should SRE management hold blameless retrospectives for mistakes?

Post author:mikal
Post published:May 4, 2024
Post category:Reliability

One of the core tenets of Site Reliability Engineering (SRE) is that blameless postmortems / retrospectives should be held for oncall incidents. Its part of the continuous improvement process where we learn from what went wrong and try and create processes to ensure it doesn't happen again. Very explicitly it is not about blaming anyone for an error -- the worst personal outcome for an engineer should be a decision that perhaps we failed to train them adequately. A simple example might be if a system broke because it ran out of disk, you might determine in the retrospective that it would be a super good idea to have some sort of alert for low disk space fire well before the system broke so you could intervene. It occurs to me that I've never seen the same process used for SRE management though, and that seems like an obvious gap to me now. Surely the same process of asking what went wrong and working out what mechanisms could be created to ensure that we're at least making new mistakes next time would be a good idea? Yet I've never seen a SRE management team willing to actually hold itself to…

Writing a terraform remote state server

Post author:mikal
Post published:January 14, 2020
Post category:Terraform

Terraform is a useful tool for deploying cloud resources. This post isn't an introduction to terraform, so I'll assume you already know and love it. If you want more, then this getting started guide would be a sensible start. At its most basic level, terraform deploys cloud resources and stores information about those resources in a file on local disk called terraform.tfstate -- it needs that state information so it can make later changes to the deployment, be those modifying resources in use or tearing the whole deployment down. If you had an operations team working on an environment, then you could store the tfstate file in git or a shared filesystem so that the entire team could manage the deployment. However, there is nothing with that approach that stops two members of the team making overlapping changes. That's where terraform state servers come in. State servers can implement optional locking, which stops overlapping operations from happening. The protocol that these servers talk isn't well documented (that I could find). I wanted to explore that more, so I wrote a simple terraform HTTP state server in python. To use this state server, configure your terraform file as per demo.tf. The…

End of content

No more pages to load