Should SRE management hold blameless retrospectives for mistakes?

One of the core tenets of Site Reliability Engineering (SRE) is that blameless postmortems / retrospectives should be held for oncall incidents. Its part of the continuous improvement process where we learn from what went wrong and try and create processes to ensure it doesn't happen again. Very explicitly it is not about blaming anyone for an error -- the worst personal outcome for an engineer should be a decision that perhaps we failed to train them adequately. A simple example might be if a system broke because it ran out of disk, you might determine in the retrospective that it would be a super good idea to have some sort of alert for low disk space fire well before the system broke so you could intervene. It occurs to me that I've never seen the same process used for SRE management though, and that seems like an obvious gap to me now. Surely the same process of asking what went wrong and working out what mechanisms could be created to ensure that we're at least making new mistakes next time would be a good idea? Yet I've never seen a SRE management team willing to actually hold itself to…

Continue ReadingShould SRE management hold blameless retrospectives for mistakes?

Folsom Dev Summit sessions

  • Post author:
  • Post category:OpenStack

I thought I should write up the dev summit sessions I am hosting now that the program is starting to look solid. This is mostly for my own benefit, so I have a solid understanding of where to start these sessions off. Both are short brainstorm sessions, so I am not intending to produce slide decks or anything like that. I just want to make sure there is something to kick discussion off. Image caching, where to from here (nova hypervisors) As of essex libvirt has an image cache to speed startup of new instances. This cache stores images direct from glance, as well as resized images. There is a periodic task which cleans up images in the cache which are no longer needed. The periodic task can also optionally detect images which have become corrupted on disk. So first off, do we want to implement this for other hypervisors as well? As mentioned in a recent blog post I'd like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner -- that makes it easier to document, and means that on-call operations people don't need to determine…

Continue ReadingFolsom Dev Summit sessions

Reflecting on Essex

  • Post author:
  • Post category:OpenStack

This post is kind of long, and a little self indulgent. However, I really wanted to spend some time thinking about what I did for the Essex release cycle, and what I want to do for the Folsom release. I spent Essex mostly hacking on things in isolation, except for when Padraig Brady and I were hacking in a similar space. I'd like to collaborate more for Folsom, and I'm hoping talking about what I'm interested in doing in public might help with that. I came relatively late to the Essex development cycle, having never even heard of OpenStack before joining Canonical. We can talk about how I'd worked in the cloud space for six years and yet wasn't aware of the open source implementations at some other time. My initial introduction to OpenStack was being paged for compute nodes which were continually running out of disk. I googled around a bit and discovered that cached images for instances were never cleaned up (to start an instance, an image is fetched from glance, possibly has its format converted, is resized, and then an instance started with that resulting image, all those images were never being cleaned up). I filed bug…

Continue ReadingReflecting on Essex

Further adventures with base images in OpenStack

  • Post author:
  • Post category:OpenStack

I was bored over the New Years weekend, so I figured I'd have a go at implementing image cache management as discussed previously. I actually have an implementation of about 75% of that blueprint now, but its not ready for prime time yet. The point of this post is more to document some stuff I learnt about VM startup along the way so I don't forget it later. So, you want to start a VM on a compute node. Once the scheduler has selected a node to run the VM on, the next step is the compute instance on that machine starting the VM up. First the specified disk image is fetched from your image service (in my case glance), and placed in a temporary location on disk. If the image is already a raw image, it is then renamed to the correct name in the instances/_base directory. If it isn't a raw image then it is converted to raw format, and that converted file is put in the right place. Optionally, the image can be extended to a specified size as part of this process. Then, depending on if you have copy on write (COW) images turned on or…

Continue ReadingFurther adventures with base images in OpenStack

Openstack compute node cleanup

  • Post author:
  • Post category:OpenStack

I've never used openstack before, which I imagine is similar to many other people out there. Its actually pretty cool, although I encountered a problem the other day that I think is worthy of some more documentation. Openstack runs virtual machines for users, in much the same manner as Amazon's EC2 system. These instances are started with a base image, and then copy on write is used to write differences for the instance as it changes stuff. This makes sense in a world where a given machine might be running more than one copy of the instance. However, I encountered a compute node which was running low on disk. This is because there is currently nothing which cleans up these base images, so even if none of the instances on a machine require that image, and even if the machine is experiencing disk stress, the images still hang around. There are a few blog posts out there about this, but nothing really definitive that I could find. I've filed a bug asking for the Ubuntu package to include some sort of cleanup script, and interestingly that led me to learn that there are plans for a pretty comprehensive image management…

Continue ReadingOpenstack compute node cleanup

MySQL Users Conference

  • Post author:
  • Post category:Mysql

Well, they're definitely thinking about getting started. Like last year I caught the VTA down -- it's hard to beat a $1.75 trip without having to worry about traffic. Registraton wasn't as smooth this year as last, for example I didn't get my free book (there didn't seem to be any attempt to hand those out to speakers). Whatever. I'm now waiting for the replication talk to start.

Continue ReadingMySQL Users Conference

Managing MySQL the Slack Way: How Google Deploys New MySQL Servers

  • Post author:
  • Post category:Mysql

I'll be presenting about Slack (the open sourced tool kit we use for deployment software configuration) at the MySQL user's conference in Santa Clara in late April. The talk will focus on the interesting aspects of Slack as it relates to MySQL and should be fun. A DBA mate of mine is gonna present with me, so it should be a barrel of laughs.

Continue ReadingManaging MySQL the Slack Way: How Google Deploys New MySQL Servers

Thoughts on the first day of the MySQL user’s conference

  • Post author:
  • Post category:Mysql

So, I attended the first day of the MySQL user's conference yesterday, which was the tutorial day. Overall I was fairly impressed. Registration was easy, the actual rooms presentations are given in are comfortable, the PA system seemed to work after some initial problems in the morning tutorial I attended. The conference center seems to be big on retirees hanging around, which I thought was weird. Each room comes with a little old lady, whose job appears to be to read a fiction novel at the door. I really have no idea what else they were achieving. They seemed to be having fun though. I did find it a bit odd that the only drinks provided by the catering staff during the day were acidic, and most of them caffinated. For example, we had choices between coffee, tea, soda water, coke, diet coke, pepsi and diet pepsi. Some fruit juice or even plain water would have been a nice change by the end of the day. The food was good, unless you're a vegan like Stewart at which time the catering staff looked confused and had to go off and get him something special (which didn't look all that special…

Continue ReadingThoughts on the first day of the MySQL user’s conference

End of content

No more pages to load