A reading group of managers at work has been reading this book, except for the last chapter which we were left to read by ourselves. Overall, the book is interesting and very readable. Its a little dated, being all excited with the invention of email and some unfortunate gender pronouns, but if you can get past those minor things there is a lot of wise advice here. I’m not sure I agree with 100% of it, but I do think the vast majority is of interest. A well written book that I’d recommend to new managers.
If I had to guess what would be a controversial topic from the mid-cycle meetup, it would have to be this slots proposal. I was actually in a Technical Committee meeting when this proposal was first made, but I’m told there were plenty of people in the room keen to give this idea a try. Since the mid-cycle Joe Gordon has written up a more formal proposal, which can be found at https://review.openstack.org/#/c/112733.
If you look at the last few Nova releases, core reviewers have been drowning under code reviews, so we need to control the review workload. What is currently happening is that everyone throws up their thing into Gerrit, and then each core tries to identify the important things and review them. There is a list of prioritized blueprints in Launchpad, but it is not used much as a way of determining what to review. The result of this is that there are hundreds of reviews outstanding for Nova (500 when I wrote this post). Many of these will get a review, but it is hard for authors to get two cores to pay attention to a review long enough for it to be approved and merged.
If we could rate limit the number of proposed reviews in Gerrit, then cores would be able to focus their attention on the smaller number of outstanding reviews, and land more code. Because each review would merge faster, we believe this rate limiting would help us land more code rather than less, as our workload would be better managed. You could argue that this will mean we just say ‘no’ more often, but that’s not the intent, it’s more about bringing focus to what we’re reviewing, so that we can get patches through the process completely. There’s nothing more frustrating to a code author than having one +2 on their code and then hitting some merge freeze deadline.
The proposal is therefore to designate a number of blueprints that can be under review at any one time. The initial proposal was for ten, and the term ‘slot’ was coined to describe the available review capacity. If your blueprint was not allocated a slot, then it would either not be proposed in Gerrit yet, or if it was it would have a procedural -2 on it (much like code reviews associated with unapproved specifications do now).
The number of slots is arbitrary at this point. Ten is our best guess of how much we can dilute core’s focus without losing efficiency. We would tweak the number as we gained experience if we went ahead with this proposal. Remember, too, that a slot isn’t always a single code review. If the VMWare refactor was in a slot for example, we might find that there were also ten code reviews associated with that single slot.
How do you determine what occupies a review slot? The proposal is to groom the list of approved specifications more carefully. We would collaboratively produce a ranked list of blueprints in the order of their importance to Nova and OpenStack overall. As slots become available, the next highest ranked blueprint with code ready for review would be moved into one of the review slots. A blueprint would be considered ‘ready for review’ once the specification is merged, and the code is complete and ready for intensive code review.
What happens if code is in a slot and something goes wrong? Imagine if a proposer goes on vacation and stops responding to review comments. If that happened we would bump the code out of the slot, but would put it back on the backlog in the location dictated by its priority. In other words there is no penalty for being bumped, you just need to wait for a slot to reappear when you’re available again.
We also talked about whether we were requiring specifications for changes which are too simple. If something is relatively uncontroversial and simple (a better tag for internationalization for example), but not a bug, it falls through the cracks of our process at the moment and ends up needing to have a specification written. There was talk of finding another way to track this work. I’m not sure I agree with this part, because a trivial specification is a relatively cheap thing to do. However, it’s something I’m happy to talk about.
We also know that Nova needs to spend more time paying down its accrued technical debt, which you can see in the huge amount of bugs we have outstanding at the moment. There is no shortage of people willing to write code for Nova, but there is a shortage of people fixing bugs and working on strategic things instead of new features. If we could reserve slots for technical debt, then it would help us to get people to work on those aspects, because they wouldn’t spend time on a less interesting problem and then discover they can’t even get their code reviewed. We even talked about having an alternating focus for Nova releases; we could have a release focused on paying down technical debt and stability, and then the next release focused on new features. The Linux kernel does something quite similar to this and it seems to work well for them.
Using slots would allow us to land more valuable code faster. Of course, it also means that some patches will get dropped on the floor, but if the system is working properly, those features will be ones that aren’t important to OpenStack. Considering that right now we’re not landing many features at all, this would be an improvement.
This proposal is obviously complicated, and everyone will have an opinion. We haven’t really thought through all the mechanics fully, yet, and it’s certainly not a done deal at this point. The ranking process seems to be the most contentious point. We could encourage the community to help us rank things by priority, but it’s not clear how that process would work. Regardless, I feel like we need to be more systematic about what code we’re trying to land. It’s embarrassing how little has landed in Juno for Nova, and we need to be working on that. I would like to continue discussing this as a community to make sure that we end up with something that works well and that everyone is happy with.
This series is nearly done, but in the next post I’ll cover the current status of the nova-network to neutron upgrade path.
I was bored over the New Years weekend, so I figured I’d have a go at implementing image cache management as discussed previously. I actually have an implementation of about 75% of that blueprint now, but its not ready for prime time yet. The point of this post is more to document some stuff I learnt about VM startup along the way so I don’t forget it later.
So, you want to start a VM on a compute node. Once the scheduler has selected a node to run the VM on, the next step is the compute instance on that machine starting the VM up. First the specified disk image is fetched from your image service (in my case glance), and placed in a temporary location on disk. If the image is already a raw image, it is then renamed to the correct name in the instances/_base directory. If it isn’t a raw image then it is converted to raw format, and that converted file is put in the right place. Optionally, the image can be extended to a specified size as part of this process.
Then, depending on if you have copy on write (COW) images turned on or not, either a COW version of the file is created inside the instances/$instance/ directory, or the file from _base is copied to instances/$instance.
This has a side effect that had me confused for a bunch of time yesterday — the checksums, and even file sizes, stored in glance are not reliable indicators of base image corruption. Most of my confusion was because image files in glance are immutable, so how come they differed from what’s on disk? The other problem was that the images I was using on my development machine were raw images, and checksums did work. It was only when I moved to a slightly more complicated environment that I had enough data to work out what was happening.
We therefore have a problem for that blueprint. We can’t use the checksums from glance as a reliable indicator of if something has gone wrong with the base image. I need to come up with something nicer. What this probably means for the first cut of the code is that checksums will only be verified for raw images which weren’t extended, but I haven’t written that code yet.
So, there we go.
I’ve never used openstack before, which I imagine is similar to many other people out there. Its actually pretty cool, although I encountered a problem the other day that I think is worthy of some more documentation. Openstack runs virtual machines for users, in much the same manner as Amazon’s EC2 system. These instances are started with a base image, and then copy on write is used to write differences for the instance as it changes stuff. This makes sense in a world where a given machine might be running more than one copy of the instance.
However, I encountered a compute node which was running low on disk. This is because there is currently nothing which cleans up these base images, so even if none of the instances on a machine require that image, and even if the machine is experiencing disk stress, the images still hang around. There are a few blog posts out there about this, but nothing really definitive that I could find. I’ve filed a bug asking for the Ubuntu package to include some sort of cleanup script, and interestingly that led me to learn that there are plans for a pretty comprehensive image management system. Unfortunately, it doesn’t seem that anyone is working on this at the moment. I would offer to lend a hand, but its not clear to me as an openstack n00b where I should start. If you read this and have some pointers, feel free to contact me.
Anyways, we still need to cleanup that node experiencing disk stress. It turns out that nova uses qemu for its copy on write disk images. We can therefore ask qemu which are in use. It goes something like this:
$ cd /var/lib/nova/instances $ find -name "disk*" | xargs -n1 qemu-img info | grep backing | \ sed -e's/.*file: //' -e 's/ .*//' | sort | uniq > /tmp/inuse
/tmp/inuse will now contain a list of the images in _base that are in use at the moment. Now you can change to the base directory, which defaults to /var/lib/nova/instances/_base and do some cleanup. What I do is I look for large image files which are several days old. I then check if they appear in that temporary file I created, and if they don’t I delete them.
I’m sure that this could be better automated by a simple python script, but I haven’t gotten around to it yet. If I do, I will be sure to mention it here.
This book by long time Apple engineering manager, as well as startup employee, Michael Lopp is a guide to how to manage geeks. That wasn’t really what I was expecting — which is sort of the inverse. I was hoping for a book about how to be a geek who has to deal with management. This book helps with that, by offering the inverse perspective, but I’d still like to see a book from my direction.
The book is well written, in a conversational and sometimes profane manner (a comment I see others make about his other book “Managing Humans”). I think that’s ok in this context, where it feels as if Michael is having a personal conversation with you the reader. An overly formal tone here would cause the content to be much more boring, and its already dry enough.
I’m not sure I agree with everything said in the book, but the first half resonated especially strongly with me.
The Boner Theory of Economics also predicts that in the long run — perhaps in a few hundred years — the military will be 100% gay men. This is the best case scenario for taxpayers because it will keep down costs, and recruiting will be easy.
Recruiter: “We can’t afford to give you body armor, but you’ll be surrounded by young, vital men who are a long way from home. Would you like a tour of the showers?”
Recruit: “Yes, but I can’t stand up right away.”
An Alfresco employee (Alfrescoer?) posts about some of the interesting things they’ve learnt about being an open source company along the way. The comments about PR being more effective the cold sales calls is especially interesting. I argued for years at TOWER that we should be paying more attention to people searching for our product, instead of paying pretty boys to drive sports cars to sales presentations that everyone secretly hates. If your product has a good reputation and people can find it online, surely the customers will come to you?
I’m very proud of myself at the moment… For the first time in a very long time my TODO list is short enough to fit vertically on a 17 inch LCD monitor. This might not sound like a big deal to other people, but it is to me.
Normally what I do is I have all of the things I need to do in tasks on my iPaq, and then have a classification called “today” which is all the stuff I should try to work on today. I do this because otherwise it is too easy to become daunted by the seemingly endless list of things to get done.
Things have two ways to getting onto the today list… If I am in interrupt mode, as I have been recently, then it is a list of things which really truly need to be done today. If I am in polled mode, then it is the four of five things I can reasonably expect to get done today.
I often don’t manage to clear the list out daily, but it gives me a more manageable feel for what needs doing.