Exporting volumes from Cinder and re-creating COW layers

Share

Today I wandered into a bit of a rat hole discovering how to export data from OpenStack Cinder volumes when you don’t have admin permissions, and I thought it was worth documenting here so I remember it for next time.

Let’s assume that you have a Cinder volume named “child1”, which is a 64gb volume originally cloned from “parent1”. parent1 is a 7.9gb VMDK, but the only way I can find to extract child1 is to convert it to a glance image and then download the entire volume as a raw. Something like this:

$ cinder upload-to-image $child1 "extract:$child1"

Where $child1 is the UUID of the Cinder volume. You then need to find the UUID of the image in Glance, which the Cinder upload-to-image command will have told you, but you can also find by searching Glance for your image named “extract:$child1”:

$ glance image-list | grep "extract:$cinder_uuid"

You now need to watch that Glance image until the status of the image is “active”. It will go through a series of steps with names like “queued”, and “uploading” first.

Now you can download the image from Glance:

$ glance image-download --file images/$child1.raw --progress $glance_uuid

And then delete the intermediate glance image:

$ glance image-delete $glance_uuid

I have a bad sample script which does thisĀ in my junk code repository if that is helpful.

What you have at the end of this is a 64gb raw disk file in my example. You can convert that file to qcow2 like this:

$ qemu-img convert $child1.raw $child1.qcow2

But you’re left with a 64gb qcow2 file for your troubles. I experimented with virt-sparsify to reduce the size of this image, but it doesn’t work in my case (no space is saved), I suspect because the disk image has multiple partitions because it originally came from a VMWare environment.

Luckily qemu-img can also re-create the COW layer that existing on the admin-only side of the public cloud barrier. You do this by rebasing the converted qcow2 file onto the original VMDK file like this:

$ qemu-img create -f qcow2 -b $parent1.qcow2 $child1.delta.qcow2
$ qemu-img rebase -b $parent1.vmdk $child1.delta.qcow2

In my case I ended up with a 289mb $child1.delta.qcow2 file, which isn’t too shabby. It took about five minutes to produce that delta on my Google Cloud instance from a 7.9gb backing file and a 64gb upper layer.

Share

Wow, qemu-img is fast

Share

I wanted to determine if its worth putting ephemeral images into the libvirt cache at all. How expensive are these images to create? They don’t need to come from the image service, so it can’t be too bad, right? It turns out that qemu-img is very very fast at creating these images, based on the very small data set of my laptop with an ext4 file system…

    mikal@x220:/data/temp$ time qemu-img create -f raw disk 10g
    Formatting 'disk', fmt=raw size=10737418240
    
    real	0m0.315s
    user	0m0.000s
    sys	0m0.004s
    
    mikal@x220:/data/temp$ time qemu-img create -f raw disk 100g
    Formatting 'disk', fmt=raw size=107374182400
    
    real	0m0.004s
    user	0m0.000s
    sys	0m0.000s
    

Perhaps this is because I am using ext4, which does funky extents things when allocating blocks. However, the only ext3 file system I could find at my place is my off site backup disks, which are USB3 attached instead of the SATA2 that my laptop uses. Here’s the number from there:

    $ time qemu-img create -f raw disk 100g
    Formatting 'disk', fmt=raw size=107374182400
    
    real	0m0.055s
    user	0m0.000s
    sys	0m0.004s
    

So still very very fast. Perhaps its the mkfs that’s slow? Here’s a run of creating a ext4 file system inside that 100gb file I just made on my laptop:

    $ time mkfs.ext4 disk
    mke2fs 1.41.14 (22-Dec-2010)
    disk is not a block special device.
    Proceed anyway? (y,n) y
    warning: Unable to get device geometry for disk
    Filesystem label=
    OS type: Linux
    Block size=4096 (log=2)
    Fragment size=4096 (log=2)
    Stride=0 blocks, Stripe width=0 blocks
    6553600 inodes, 26214400 blocks
    1310720 blocks (5.00%) reserved for the super user
    First data block=0
    Maximum filesystem blocks=0
    800 block groups
    32768 blocks per group, 32768 fragments per group
    8192 inodes per group
    Superblock backups stored on blocks:
    	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    	4096000, 7962624, 11239424, 20480000, 23887872
    
    Writing inode tables: done
    Creating journal (32768 blocks): done
    Writing superblocks and filesystem accounting information: done
    
    This filesystem will be automatically checked every 36 mounts or
    180 days, whichever comes first.  Use tune2fs -c or -i to override.
    
    real	0m4.083s
    user	0m0.096s
    sys	0m0.136s
    

That time includes the time it took me to hit the ‘y’ key, as I couldn’t immediately find a flag to stop prompting.

In conclusion, there is nothing slow here. I don’t see why we’d want to cache ephemeral disks and use copy on write for them at all. Its very cheap to just create a new one each time, and it makes the code much simpler.

Share

Further adventures with base images in OpenStack

Share

I was bored over the New Years weekend, so I figured I’d have a go at implementing image cache management as discussed previously. I actually have an implementation of about 75% of that blueprint now, but its not ready for prime time yet. The point of this post is more to document some stuff I learnt about VM startup along the way so I don’t forget it later.

So, you want to start a VM on a compute node. Once the scheduler has selected a node to run the VM on, the next step is the compute instance on that machine starting the VM up. First the specified disk image is fetched from your image service (in my case glance), and placed in a temporary location on disk. If the image is already a raw image, it is then renamed to the correct name in the instances/_base directory. If it isn’t a raw image then it is converted to raw format, and that converted file is put in the right place. Optionally, the image can be extended to a specified size as part of this process.

Then, depending on if you have copy on write (COW) images turned on or not, either a COW version of the file is created inside the instances/$instance/ directory, or the file from _base is copied to instances/$instance.

This has a side effect that had me confused for a bunch of time yesterday — the checksums, and even file sizes, stored in glance are not reliable indicators of base image corruption. Most of my confusion was because image files in glance are immutable, so how come they differed from what’s on disk? The other problem was that the images I was using on my development machine were raw images, and checksums did work. It was only when I moved to a slightly more complicated environment that I had enough data to work out what was happening.

We therefore have a problem for that blueprint. We can’t use the checksums from glance as a reliable indicator of if something has gone wrong with the base image. I need to come up with something nicer. What this probably means for the first cut of the code is that checksums will only be verified for raw images which weren’t extended, but I haven’t written that code yet.

So, there we go.

Share

Openstack compute node cleanup

Share

I’ve never used openstack before, which I imagine is similar to many other people out there. Its actually pretty cool, although I encountered a problem the other day that I think is worthy of some more documentation. Openstack runs virtual machines for users, in much the same manner as Amazon’s EC2 system. These instances are started with a base image, and then copy on write is used to write differences for the instance as it changes stuff. This makes sense in a world where a given machine might be running more than one copy of the instance.

However, I encountered a compute node which was running low on disk. This is because there is currently nothing which cleans up these base images, so even if none of the instances on a machine require that image, and even if the machine is experiencing disk stress, the images still hang around. There are a few blog posts out there about this, but nothing really definitive that I could find. I’ve filed a bug asking for the Ubuntu package to include some sort of cleanup script, and interestingly that led me to learn that there are plans for a pretty comprehensive image management system. Unfortunately, it doesn’t seem that anyone is working on this at the moment. I would offer to lend a hand, but its not clear to me as an openstack n00b where I should start. If you read this and have some pointers, feel free to contact me.

Anyways, we still need to cleanup that node experiencing disk stress. It turns out that nova uses qemu for its copy on write disk images. We can therefore ask qemu which are in use. It goes something like this:

    $ cd /var/lib/nova/instances
    $ find -name "disk*" | xargs -n1 qemu-img info | grep backing | \
      sed -e's/.*file: //' -e 's/ .*//' | sort | uniq > /tmp/inuse
    

/tmp/inuse will now contain a list of the images in _base that are in use at the moment. Now you can change to the base directory, which defaults to /var/lib/nova/instances/_base and do some cleanup. What I do is I look for large image files which are several days old. I then check if they appear in that temporary file I created, and if they don’t I delete them.

I’m sure that this could be better automated by a simple python script, but I haven’t gotten around to it yet. If I do, I will be sure to mention it here.

Share