Folsom Dev Summit sessions

Share

I thought I should write up the dev summit sessions I am hosting now that the program is starting to look solid. This is mostly for my own benefit, so I have a solid understanding of where to start these sessions off. Both are short brainstorm sessions, so I am not intending to produce slide decks or anything like that. I just want to make sure there is something to kick discussion off.

Image caching, where to from here (nova hypervisors)

As of essex libvirt has an image cache to speed startup of new instances. This cache stores images direct from glance, as well as resized images. There is a periodic task which cleans up images in the cache which are no longer needed. The periodic task can also optionally detect images which have become corrupted on disk.

So first off, do we want to implement this for other hypervisors as well? As mentioned in a recent blog post I’d like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner — that makes it easier to document, and means that on-call operations people don’t need to determine what hypervisor a compute node is running before starting to debug. However, that requires the other hypervisor implementations to change how they stage images for instance startup, and I think it bears further discussion.

Additionally, the blueprint (https://blueprints.launchpad.net/nova/+spec/nova-image-cache-management) proposed that popular / strategic images could be pre-cached on compute nodes. Is this something we still want to do? What factors do we want to use for the reference implementation? I have a few ideas here that are listed in the blueprint, but most of them require talking to glance to implement. There is some hesitance in adding glance calls to a periodic task, because in a keystone’d implementation that would require an admin token in the nova configuration file. Is there a better way to do this, or is it ok to rely on glance in a periodic task?

Ops pain points (nova other)

Apart from my own ideas (better instance logging for example), I’m very interested in hearing from other people about what we can do to make nova easier for ops people to run. This is especially true for relatively easy to implement things we can get done in Folsom. This blueprint for deployer friendly configuration files is a good example of changes which don’t look too hard to implement, but that would make the world a better place for opsen. There are many other examples of blueprints in this space, including:

What else can we be doing to make life better for opsen? I’m especially interested in getting people who actually run openstack in the wild into the room to tell us what is painful for them at the moment.

Share

Reflecting on Essex

Share

This post is kind of long, and a little self indulgent. However, I really wanted to spend some time thinking about what I did for the Essex release cycle, and what I want to do for the Folsom release. I spent Essex mostly hacking on things in isolation, except for when Padraig Brady and I were hacking in a similar space. I’d like to collaborate more for Folsom, and I’m hoping talking about what I’m interested in doing in public might help with that.

I came relatively late to the Essex development cycle, having never even heard of OpenStack before joining Canonical. We can talk about how I’d worked in the cloud space for six years and yet wasn’t aware of the open source implementations at some other time.

My initial introduction to OpenStack was being paged for compute nodes which were continually running out of disk. I googled around a bit and discovered that cached images for instances were never cleaned up (to start an instance, an image is fetched from glance, possibly has its format converted, is resized, and then an instance started with that resulting image, all those images were never being cleaned up). I filed bug 904532 as my absolute first interaction with the OpenStack community. Scott Moser kindly pointed me at the blueprint for how to actually fix the problem.

(Remind me if Phil Day comes to the OpenStack developer summit that I should sit down with him at some point and see how what close what was actually implemented got to what he wrote in that blueprint. I suspect we’ve still got a fair way to go, but I’ll talk more about that later in this post).

This was a pivotal moment. I’d just spent the last six years writing python code to manage largish cloud clusters, and here was a bug which was hurting me in a python package intended to manage clusters very similar to those I had been running. I should just fix the bug, right?

It turns out that the OpenStack core developers are super easy to work with. I’d say that the code review process certainly feels like it was modelled on Google’s but in general the code reviewers are nicer with their comments that what I’m used to. This makes it much easier to motivate yourself to go and spend some more time hacking that a deeply negative review would. I think Vish is especially worthy of a shout out as being an amazing person to work with. He’s helpful, patient, and very smart.

In the end I wrote the image cache manager which ships in Essex. Its not perfect, but its a lot better than what came before, and its a good basis to build on. There is some remaining tech debt for image cache management which I intend to work on for Folsom. First off, the image cache only works for libvirt instances at the moment. I’d like to pull all the other hypervisors into line as best as possible. There are hooks in the virtualization driver for this, but no one has started this work as best as I am aware. To be completely honest I’d like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner — that makes it easier to document, and means that on-call operations people don’t need to determine what hypervisor a compute node is running before starting to debug. This is something I very much want to sit down with other nova developers and talk about at the summit.

The next step for image cache management is tracked in a very bare bones blueprint. The original blueprint envisaged that it would be desirable to pre-cache some images on all nodes. For example, a cloud host might want to offer slightly faster startup times for some images by ensuring they are pre-cached. I’ve been thinking about this a lot, and I can see other use cases here as well. For example, if you have mission critical instances and you wanted to tolerate a glance failure, then perhaps you want to pre-cache a class of images that serve those mission critical instances. The intention is to provide an interface and default implementation for the pre-caching logic, and then let users go wild working out their own requirements.

The hardest bit of the pre-caching will be reducing the interactions with glance I suspect. The current feeling is that calling glance from a periodic task is a bit scary, and has been actively avoided for Essex. This is especially true if Keystone is enabled, as the periodic task wont have an admin context unless we pull that from the config file. However, if you’re trying to determine what images are mission critical, then you really need to talk to glance. I guess another option would be to have a table of such things in nova’s database, but that feels wrong to me. We’re going to have to talk about this bit more.

(It would be interesting as well to talk about the relative priority of instances as well. If a cluster is experiencing outages, then perhaps some customers would pay more to have their instances be the last killed off or something. Or perhaps I have instances which are less critical than others, so I want the cluster to degrade in an understood manner.)

That leads logically onto a scheduler change I would like to see. If I have a set of compute nodes I know already have the image for a given instance, shouldn’t I prefer to start instances on those nodes instead of fetching the image to yet more compute nodes? In fact, if I already have a correctly resized COW base image for an instance on a given node, then it would make sense to run a new instance on that node as well. We need to be careful here, because you wouldn’t want to run all of a given class of instance on a small set of compute nodes, but if the image was something like a default Ubuntu image, then it would make sense. I’d be interested in hearing what other people think of doing something like this.

Another thing I’ve tried to focus on for Essex is making OpenStack easier for operators to run. That started off relatively simply, by adding an option for log messages to specify what instance a message relates to. This means that when a user queries the state of their instance, the admin can now just grep for the instance UUID, and run from there. Its not perfect yet, in that not all messages use this functionality, but that’s some tech debt that I will take on in Folsom. If you’re a nova developer, then please pass instance= in your log messages where relevant!

This logging functionality isn’t perfect, because if you only have the instance UUID in the method you’re writing, it wont work. It expects full instance dicts because of the way the formatting code works. This is kind of ironic in that the default logging format only includes the UUID. In Folsom I’ll also extend this code so that the right thing happens with UUIDs as well.

Another simple logging tweak I wrote is that tracebacks now have the time and instance included in them. This makes it much easier for admins to determine the context of a traceback in their logs. It should be noted that both of these changes was relatively trivial, but trivial things can often make it much easier for others.

There are two sessions at the Folsom dev summit talking about how to make OpenStack easier for operators to run. One was from me, and the other is from Duncan McGreggor. Neither has been accepted yet, but if I notice that Duncan’s was accepted I’ll drop mine. I’m very very interested in what operations staff feel is currently painful, because having something which is easy to scale and manage is vital to adoption. This is also the core of what I did at Google, and I feel I can make a real contribution here.

I know I’ve come relatively late to the OpenStack party, but there’s heaps more to do here and I’m super enthused to be working on code that I can finally show people again.

Share

Further adventures with base images in OpenStack

Share

I was bored over the New Years weekend, so I figured I’d have a go at implementing image cache management as discussed previously. I actually have an implementation of about 75% of that blueprint now, but its not ready for prime time yet. The point of this post is more to document some stuff I learnt about VM startup along the way so I don’t forget it later.

So, you want to start a VM on a compute node. Once the scheduler has selected a node to run the VM on, the next step is the compute instance on that machine starting the VM up. First the specified disk image is fetched from your image service (in my case glance), and placed in a temporary location on disk. If the image is already a raw image, it is then renamed to the correct name in the instances/_base directory. If it isn’t a raw image then it is converted to raw format, and that converted file is put in the right place. Optionally, the image can be extended to a specified size as part of this process.

Then, depending on if you have copy on write (COW) images turned on or not, either a COW version of the file is created inside the instances/$instance/ directory, or the file from _base is copied to instances/$instance.

This has a side effect that had me confused for a bunch of time yesterday — the checksums, and even file sizes, stored in glance are not reliable indicators of base image corruption. Most of my confusion was because image files in glance are immutable, so how come they differed from what’s on disk? The other problem was that the images I was using on my development machine were raw images, and checksums did work. It was only when I moved to a slightly more complicated environment that I had enough data to work out what was happening.

We therefore have a problem for that blueprint. We can’t use the checksums from glance as a reliable indicator of if something has gone wrong with the base image. I need to come up with something nicer. What this probably means for the first cut of the code is that checksums will only be verified for raw images which weren’t extended, but I haven’t written that code yet.

So, there we go.

Share

Openstack compute node cleanup

Share

I’ve never used openstack before, which I imagine is similar to many other people out there. Its actually pretty cool, although I encountered a problem the other day that I think is worthy of some more documentation. Openstack runs virtual machines for users, in much the same manner as Amazon’s EC2 system. These instances are started with a base image, and then copy on write is used to write differences for the instance as it changes stuff. This makes sense in a world where a given machine might be running more than one copy of the instance.

However, I encountered a compute node which was running low on disk. This is because there is currently nothing which cleans up these base images, so even if none of the instances on a machine require that image, and even if the machine is experiencing disk stress, the images still hang around. There are a few blog posts out there about this, but nothing really definitive that I could find. I’ve filed a bug asking for the Ubuntu package to include some sort of cleanup script, and interestingly that led me to learn that there are plans for a pretty comprehensive image management system. Unfortunately, it doesn’t seem that anyone is working on this at the moment. I would offer to lend a hand, but its not clear to me as an openstack n00b where I should start. If you read this and have some pointers, feel free to contact me.

Anyways, we still need to cleanup that node experiencing disk stress. It turns out that nova uses qemu for its copy on write disk images. We can therefore ask qemu which are in use. It goes something like this:

    $ cd /var/lib/nova/instances
    $ find -name "disk*" | xargs -n1 qemu-img info | grep backing | \
      sed -e's/.*file: //' -e 's/ .*//' | sort | uniq > /tmp/inuse
    

/tmp/inuse will now contain a list of the images in _base that are in use at the moment. Now you can change to the base directory, which defaults to /var/lib/nova/instances/_base and do some cleanup. What I do is I look for large image files which are several days old. I then check if they appear in that temporary file I created, and if they don’t I delete them.

I’m sure that this could be better automated by a simple python script, but I haven’t gotten around to it yet. If I do, I will be sure to mention it here.

Share