A first pass at glance replication

A few weeks back I was tasked with turning up a new OpenStack region. This region couldn’t share anything with existing regions because the plan was to test pre-release versions of OpenStack there, and if we shared something like glance then we would either have to endanger glance for all regions during testing, or not test glance. However, our users already have a favorite set of images uploaded to glance, and I really wanted to make it as easy as possible for them to use the new region — I wanted all of their images to magically just appear there. What I needed was some form of glance replication.

I’d sat in on the glance replication session at the Folsom OpenStack Design Summit. The NeCTAR use case at the bottom is exactly what I wanted, so its reassuring that other people wanted something like that too. However, no one was working on this feature. So I wrote it. In fact, because of the code review process I wrote it twice, but let’s not dwell on that too much.

So, as of change id I7dabbd6671ec75a0052db58312054f611707bdcf there is a very simple replicator script in glance/bin. Its not perfect, and I expect it will need to be extended a bunch, but its a start at least and I’m using it in production now so I am relatively confident its not totally wrong.


The replicator supports the following commands at the moment:

livecopy

glance-replicator livecopy fromserver:port toserver:port

    Load the contents of one glance instance into another.

    fromserver:port: the location of the master glance instance.
    toserver:port:   the location of the slave glance instance.

This is the main meat of the replicator. Take a copy of the fromserver, and dump it onto the toserver. Only images visible to the user running the replicator will be copied if you’re using Keystone. Only images active on fromserver are copied across. The copy is done “on-the-wire”, so there are no large temporary files on the machine running the replicator to clean up.

dump

glance-replicator dump server:port path

    Dump the contents of a glance instance to local disk.

    server:port: the location of the glance instance.
    path:        a directory on disk to contain the data.

Do the same thing as livecopy, but dump the contents of the glance server to a directory on disk. This includes meta data and image data, and this directory is probably going to be quite large so be prepared.

load

glance-replicator load server:port path

    Load the contents of a local directory into glance.

    server:port: the location of the glance instance.
    path:        a directory on disk containing the data.

Load a directory created by the dump command into a glance server. dump / load was originally written because I had two glance servers who couldn’t talk to each other over the network for policy reasons. However, I could dump the data and move it to the destination network out of band. If you had a very large glance installation and were bringing up a new region at the end of a slow link, then this might be something you’d be interested in.

compare

glance-replicator compare fromserver:port toserver:port

    Compare the contents of fromserver with those of toserver.

    fromserver:port: the location of the master glance instance.
    toserver:port:   the location of the slave glance instance.

What would a livecopy do? The compare command will show you the differences between the two servers, so its a bit like a dry run of the replication.

size

glance-replicator size 

    Determine the size of a glance instance if dumped to disk.

    server:port: the location of the glance instance.

The size command will tell you how much disk is going to be used by image data in either a dump or a livecopy. It doesn’t however know about redundancy costs with things like swift, so it just gives you the raw number of bytes that would be written to the destination.


The glance replicator is very new code, so I wouldn’t be too surprised if there are bugs out there or obvious features that are lacking. For example, there is no support for SSL at the moment. Let me know if you have any comments or encounter problems using the replicator.

Further adventures with base images in OpenStack

I was bored over the New Years weekend, so I figured I’d have a go at implementing image cache management as discussed previously. I actually have an implementation of about 75% of that blueprint now, but its not ready for prime time yet. The point of this post is more to document some stuff I learnt about VM startup along the way so I don’t forget it later.

So, you want to start a VM on a compute node. Once the scheduler has selected a node to run the VM on, the next step is the compute instance on that machine starting the VM up. First the specified disk image is fetched from your image service (in my case glance), and placed in a temporary location on disk. If the image is already a raw image, it is then renamed to the correct name in the instances/_base directory. If it isn’t a raw image then it is converted to raw format, and that converted file is put in the right place. Optionally, the image can be extended to a specified size as part of this process.

Then, depending on if you have copy on write (COW) images turned on or not, either a COW version of the file is created inside the instances/$instance/ directory, or the file from _base is copied to instances/$instance.

This has a side effect that had me confused for a bunch of time yesterday — the checksums, and even file sizes, stored in glance are not reliable indicators of base image corruption. Most of my confusion was because image files in glance are immutable, so how come they differed from what’s on disk? The other problem was that the images I was using on my development machine were raw images, and checksums did work. It was only when I moved to a slightly more complicated environment that I had enough data to work out what was happening.

We therefore have a problem for that blueprint. We can’t use the checksums from glance as a reliable indicator of if something has gone wrong with the base image. I need to come up with something nicer. What this probably means for the first cut of the code is that checksums will only be verified for raw images which weren’t extended, but I haven’t written that code yet.

So, there we go.

Openstack compute node cleanup

I’ve never used openstack before, which I imagine is similar to many other people out there. Its actually pretty cool, although I encountered a problem the other day that I think is worthy of some more documentation. Openstack runs virtual machines for users, in much the same manner as Amazon’s EC2 system. These instances are started with a base image, and then copy on write is used to write differences for the instance as it changes stuff. This makes sense in a world where a given machine might be running more than one copy of the instance.

However, I encountered a compute node which was running low on disk. This is because there is currently nothing which cleans up these base images, so even if none of the instances on a machine require that image, and even if the machine is experiencing disk stress, the images still hang around. There are a few blog posts out there about this, but nothing really definitive that I could find. I’ve filed a bug asking for the Ubuntu package to include some sort of cleanup script, and interestingly that led me to learn that there are plans for a pretty comprehensive image management system. Unfortunately, it doesn’t seem that anyone is working on this at the moment. I would offer to lend a hand, but its not clear to me as an openstack n00b where I should start. If you read this and have some pointers, feel free to contact me.

Anyways, we still need to cleanup that node experiencing disk stress. It turns out that nova uses qemu for its copy on write disk images. We can therefore ask qemu which are in use. It goes something like this:

    $ cd /var/lib/nova/instances
    $ find -name "disk*" | xargs -n1 qemu-img info | grep backing | \
      sed -e's/.*file: //' -e 's/ .*//' | sort | uniq > /tmp/inuse
    

/tmp/inuse will now contain a list of the images in _base that are in use at the moment. Now you can change to the base directory, which defaults to /var/lib/nova/instances/_base and do some cleanup. What I do is I look for large image files which are several days old. I then check if they appear in that temporary file I created, and if they don’t I delete them.

I’m sure that this could be better automated by a simple python script, but I haven’t gotten around to it yet. If I do, I will be sure to mention it here.