Moving on

Thursday this week is my last day at Canonical. After a little over a year at Canonical, I’m moving on to the private cloud team at Rackspace — my first day with Rackspace will be the 17th of December. I’m very excited to be joining Rackspace — I’m excited by the project, the team, and the opportunity to make OpenStack even better. We’ve also talked about some interesting stuff we’d like to do in the Australian OpenStack community, but I’m going to hold off on talking about that until I’ve had a chance to settle in.

I am appreciative of my time at Canonical — when I joined I was unaware of the existence of OpenStack, and without Canonical I might never have found this awesome project that I love. I also had the chance to work with some really smart people who taught me a lot. This move is about spending more time on OpenStack than Canonical was able to allow.

On conference t-shirts

Conference t-shirts can’t be that hard, right? I certainly don’t remember them being difficult when Canberra last hosted linux.conf.au in 2005. I was the person who arranged all the swag for that conference, so I should remember. Yet here I am having spent hours on the phone with vendors, and surrounded with discarded sample t-shirts, size charts and colour swatches. What changed?

The difference between now and then is that in the intervening seven years the Australian Linux community has started to make real effort to be more inclusive. We have anti-harassment policies, we encourage new speakers, and we’re making real efforts to encourage more women into the community.

linux.conf.au 2013 is making real efforts to be as inclusive as possible — one of the first roles we allocated was a diversity officer, who is someone active in the geek feminism community. We’ve had serious discussions about how we can make our event as friendly to all groups as possible, and have some interesting things along those lines to announce soon. We’re working hard to make the conference a safe environment for everyone, and will have independent delegate advocates available at all social events, as well as during the conference.

What I want to specifically talk about here is the conference t-shirts though. We started out with the following criteria — we wanted to provide a men’s cut, and a separate women’s cut, because we recognize that unisex t-shirts are not a good solution for most women. We also need a wider than usual size range in those shirts because we have a diverse set of delegates attending our event. We also didn’t really want to do black, dark blue, or white shirts — mostly because those colours are overdone, but also because the conference is in January when the mean temperature is around 30 degrees Celsius.

Surprisingly, those criteria eliminate the two largest vendors of t-shirts in Australia. Neither Hanes nor Gildan make any t-shirt that has both men’s and women’s cuts, in interesting colours and with a large size variety. So we went on the hunt for other manufacturers. However, I’m jumping a little ahead of myself here, so bear with me.

First off we picked a Hanes shirt because we liked the look of it. We were comfortable with that choice for quite a while before we discovered that the range of colours available in both the men’s and women’s cut was quite small. Sure, there are heaps of colours in each cut, but the overlapping set of colours is much smaller than it first appears. At this point we knew we needed to find a new vendor.

The next most obvious choice is Gildan. Gildan does some really nice shirts, and I immediately fell in love with a colour called “charcoal”. However, once bitten twice shy, so we ordered some sample t-shirts for my wife and I to try out. I’m glad we did this, because the women’s cut was a disaster. First off it didn’t fit my wife very well in the size she normally wears, which it turns out is because the lighter cotton style of t-shirt is 10 centimeters smaller horizontally than the thicker cotton version! It got even worse when we washed the shirts and tried them again — the shirt shrunk significantly on first wash. We also noticed something else which had escaped our attention — the absolute largest size that Gildan did in our chosen style for women was a XXL. Given the sizing ran small, that probably made the largest actual size we could provide a mere XL. That’s not good enough.

Gildan was clearly not going to work for us. I got back on the phone with the supplier who was helping us out and we spent about an hour talking over our requirements and the problems we were seeing with the samples. We even discussed getting a run of custom shirts made overseas and shipped in, but the timing wouldn’t work out. They promised to go away and see what other vendors they could find in this space. Luckily for us they came back with a vendor called BizCollection, who do soft cotton shirts in the charcoal colour I like.

So next we ordered samples of this shirt. It looked good initially — my shirt fit well, as did my wife’s. However, we’d now learnt that testing the shirts through a few wash cycles was useful. So then my wife and I wore the shirts as much as we could for a week, washing them each evening and abusing them in all the ways we could think of — using the dryer, hanging them outside in the sun, pretty much everything apart from jumping up and down on them. I have to say these shirts have held up well, and we’re very happy with them.

The next step is I’m going to go back and order a bunch more sample shirts and make my team wear them. The goal here is to try and validate the size charts that the vendor provides and make sure that we can provide as much advice about fit as possible to delegates. Also, I love a free t-shirt.

After all this we still recognize that some people will never be happy with the conference’s t-shirt. Perhaps they hate the colour or the design, or perhaps they’re very tall and every t-shirt is too short for them. So the final thing we’re doing is we’re giving delegates a choice — they can select between a t-shirt, a branded cap, or a reusable coffee cup. In this way we don’t force delegates to receive something they don’t really want and are unlikely to use.

When you register for the conference, please try to remember that we’ve put a lot of effort as an organizing team into being as detail oriented as possible with all the little things we think delegates care about. I’m sure we’ve made some mistakes, but we are volunteers after all who are doing our best. If you do see something you think can be improved I’d ask that you come and speak to us privately first and give us a chance to make it right before you complain in public.

Thanks for reading my rant about conference t-shirts.

A first pass at glance replication

A few weeks back I was tasked with turning up a new OpenStack region. This region couldn’t share anything with existing regions because the plan was to test pre-release versions of OpenStack there, and if we shared something like glance then we would either have to endanger glance for all regions during testing, or not test glance. However, our users already have a favorite set of images uploaded to glance, and I really wanted to make it as easy as possible for them to use the new region — I wanted all of their images to magically just appear there. What I needed was some form of glance replication.

I’d sat in on the glance replication session at the Folsom OpenStack Design Summit. The NeCTAR use case at the bottom is exactly what I wanted, so its reassuring that other people wanted something like that too. However, no one was working on this feature. So I wrote it. In fact, because of the code review process I wrote it twice, but let’s not dwell on that too much.

So, as of change id I7dabbd6671ec75a0052db58312054f611707bdcf there is a very simple replicator script in glance/bin. Its not perfect, and I expect it will need to be extended a bunch, but its a start at least and I’m using it in production now so I am relatively confident its not totally wrong.


The replicator supports the following commands at the moment:

livecopy

glance-replicator livecopy fromserver:port toserver:port

    Load the contents of one glance instance into another.

    fromserver:port: the location of the master glance instance.
    toserver:port:   the location of the slave glance instance.

This is the main meat of the replicator. Take a copy of the fromserver, and dump it onto the toserver. Only images visible to the user running the replicator will be copied if you’re using Keystone. Only images active on fromserver are copied across. The copy is done “on-the-wire”, so there are no large temporary files on the machine running the replicator to clean up.

dump

glance-replicator dump server:port path

    Dump the contents of a glance instance to local disk.

    server:port: the location of the glance instance.
    path:        a directory on disk to contain the data.

Do the same thing as livecopy, but dump the contents of the glance server to a directory on disk. This includes meta data and image data, and this directory is probably going to be quite large so be prepared.

load

glance-replicator load server:port path

    Load the contents of a local directory into glance.

    server:port: the location of the glance instance.
    path:        a directory on disk containing the data.

Load a directory created by the dump command into a glance server. dump / load was originally written because I had two glance servers who couldn’t talk to each other over the network for policy reasons. However, I could dump the data and move it to the destination network out of band. If you had a very large glance installation and were bringing up a new region at the end of a slow link, then this might be something you’d be interested in.

compare

glance-replicator compare fromserver:port toserver:port

    Compare the contents of fromserver with those of toserver.

    fromserver:port: the location of the master glance instance.
    toserver:port:   the location of the slave glance instance.

What would a livecopy do? The compare command will show you the differences between the two servers, so its a bit like a dry run of the replication.

size

glance-replicator size 

    Determine the size of a glance instance if dumped to disk.

    server:port: the location of the glance instance.

The size command will tell you how much disk is going to be used by image data in either a dump or a livecopy. It doesn’t however know about redundancy costs with things like swift, so it just gives you the raw number of bytes that would be written to the destination.


The glance replicator is very new code, so I wouldn’t be too surprised if there are bugs out there or obvious features that are lacking. For example, there is no support for SSL at the moment. Let me know if you have any comments or encounter problems using the replicator.

Got Something to Say? The LCA 2013 CFP Opens Soon!

The call for presentations opens on 1 June, which is only 11 days away! So if you’re thinking of speaking at the conference (a presentation, tutorial, or miniconference), now would be a good time to start thinking about what you’re going to say. While you’re thinking, please spare a thought for our web team, who are bringing up the entire zookeepr instance so that the CFP will work properly.

We’ve been getting heaps of stuff done over the past few months. We’ve had a “ghosts” meeting (a meeting with former LCA directors), found conference and social venues, and are gearing up for the Call For Presentations.

We’ve signed a contract for the keynote venue, which I think you will all really enjoy. We have also locked in our booking for the lecture theatres, which is now working its way through the ANU process. For social events, we’ve got a great venue for the penguin dinner, and have shortlisted venues for the speakers’ dinner and the professional delegates’ networking session. We’re taking a bit of extra time here because we want venues that are special, and not just the ones which first came to mind.

The ghosts meeting went really well and I think we learnt some important things. The LCA 2013 team is a bit unusual, because so many of us have been on a LCA core team before, but that gave us a chance to dig into things which deserved more attention and skip over the things which are self-evident. We want to take the opportunity in 2013 to have the most accessible, diverse and technically deep conference that we possibly can, and there was a lot of discussion around those issues. We’ve also had it drummed into us that communications with delegates is vitally important and you should expect our attempts to communicate to ramp up as the conference approaches.

I’m really excited about the progress we’ve made so far, and I feel like we’re in a really good state right now. As always, please feel free to contact the LCA2013 team at contact@lca2013.linux.org.au if you have any questions.

Folsom Dev Summit sessions

I thought I should write up the dev summit sessions I am hosting now that the program is starting to look solid. This is mostly for my own benefit, so I have a solid understanding of where to start these sessions off. Both are short brainstorm sessions, so I am not intending to produce slide decks or anything like that. I just want to make sure there is something to kick discussion off.

Image caching, where to from here (nova hypervisors)

As of essex libvirt has an image cache to speed startup of new instances. This cache stores images direct from glance, as well as resized images. There is a periodic task which cleans up images in the cache which are no longer needed. The periodic task can also optionally detect images which have become corrupted on disk.

So first off, do we want to implement this for other hypervisors as well? As mentioned in a recent blog post I’d like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner — that makes it easier to document, and means that on-call operations people don’t need to determine what hypervisor a compute node is running before starting to debug. However, that requires the other hypervisor implementations to change how they stage images for instance startup, and I think it bears further discussion.

Additionally, the blueprint (https://blueprints.launchpad.net/nova/+spec/nova-image-cache-management) proposed that popular / strategic images could be pre-cached on compute nodes. Is this something we still want to do? What factors do we want to use for the reference implementation? I have a few ideas here that are listed in the blueprint, but most of them require talking to glance to implement. There is some hesitance in adding glance calls to a periodic task, because in a keystone’d implementation that would require an admin token in the nova configuration file. Is there a better way to do this, or is it ok to rely on glance in a periodic task?

Ops pain points (nova other)

Apart from my own ideas (better instance logging for example), I’m very interested in hearing from other people about what we can do to make nova easier for ops people to run. This is especially true for relatively easy to implement things we can get done in Folsom. This blueprint for deployer friendly configuration files is a good example of changes which don’t look too hard to implement, but that would make the world a better place for opsen. There are many other examples of blueprints in this space, including:

What else can we be doing to make life better for opsen? I’m especially interested in getting people who actually run openstack in the wild into the room to tell us what is painful for them at the moment.

Reflecting on Essex

This post is kind of long, and a little self indulgent. However, I really wanted to spend some time thinking about what I did for the Essex release cycle, and what I want to do for the Folsom release. I spent Essex mostly hacking on things in isolation, except for when Padraig Brady and I were hacking in a similar space. I’d like to collaborate more for Folsom, and I’m hoping talking about what I’m interested in doing in public might help with that.

I came relatively late to the Essex development cycle, having never even heard of OpenStack before joining Canonical. We can talk about how I’d worked in the cloud space for six years and yet wasn’t aware of the open source implementations at some other time.

My initial introduction to OpenStack was being paged for compute nodes which were continually running out of disk. I googled around a bit and discovered that cached images for instances were never cleaned up (to start an instance, an image is fetched from glance, possibly has its format converted, is resized, and then an instance started with that resulting image, all those images were never being cleaned up). I filed bug 904532 as my absolute first interaction with the OpenStack community. Scott Moser kindly pointed me at the blueprint for how to actually fix the problem.

(Remind me if Phil Day comes to the OpenStack developer summit that I should sit down with him at some point and see how what close what was actually implemented got to what he wrote in that blueprint. I suspect we’ve still got a fair way to go, but I’ll talk more about that later in this post).

This was a pivotal moment. I’d just spent the last six years writing python code to manage largish cloud clusters, and here was a bug which was hurting me in a python package intended to manage clusters very similar to those I had been running. I should just fix the bug, right?

It turns out that the OpenStack core developers are super easy to work with. I’d say that the code review process certainly feels like it was modelled on Google’s but in general the code reviewers are nicer with their comments that what I’m used to. This makes it much easier to motivate yourself to go and spend some more time hacking that a deeply negative review would. I think Vish is especially worthy of a shout out as being an amazing person to work with. He’s helpful, patient, and very smart.

In the end I wrote the image cache manager which ships in Essex. Its not perfect, but its a lot better than what came before, and its a good basis to build on. There is some remaining tech debt for image cache management which I intend to work on for Folsom. First off, the image cache only works for libvirt instances at the moment. I’d like to pull all the other hypervisors into line as best as possible. There are hooks in the virtualization driver for this, but no one has started this work as best as I am aware. To be completely honest I’d like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner — that makes it easier to document, and means that on-call operations people don’t need to determine what hypervisor a compute node is running before starting to debug. This is something I very much want to sit down with other nova developers and talk about at the summit.

The next step for image cache management is tracked in a very bare bones blueprint. The original blueprint envisaged that it would be desirable to pre-cache some images on all nodes. For example, a cloud host might want to offer slightly faster startup times for some images by ensuring they are pre-cached. I’ve been thinking about this a lot, and I can see other use cases here as well. For example, if you have mission critical instances and you wanted to tolerate a glance failure, then perhaps you want to pre-cache a class of images that serve those mission critical instances. The intention is to provide an interface and default implementation for the pre-caching logic, and then let users go wild working out their own requirements.

The hardest bit of the pre-caching will be reducing the interactions with glance I suspect. The current feeling is that calling glance from a periodic task is a bit scary, and has been actively avoided for Essex. This is especially true if Keystone is enabled, as the periodic task wont have an admin context unless we pull that from the config file. However, if you’re trying to determine what images are mission critical, then you really need to talk to glance. I guess another option would be to have a table of such things in nova’s database, but that feels wrong to me. We’re going to have to talk about this bit more.

(It would be interesting as well to talk about the relative priority of instances as well. If a cluster is experiencing outages, then perhaps some customers would pay more to have their instances be the last killed off or something. Or perhaps I have instances which are less critical than others, so I want the cluster to degrade in an understood manner.)

That leads logically onto a scheduler change I would like to see. If I have a set of compute nodes I know already have the image for a given instance, shouldn’t I prefer to start instances on those nodes instead of fetching the image to yet more compute nodes? In fact, if I already have a correctly resized COW base image for an instance on a given node, then it would make sense to run a new instance on that node as well. We need to be careful here, because you wouldn’t want to run all of a given class of instance on a small set of compute nodes, but if the image was something like a default Ubuntu image, then it would make sense. I’d be interested in hearing what other people think of doing something like this.

Another thing I’ve tried to focus on for Essex is making OpenStack easier for operators to run. That started off relatively simply, by adding an option for log messages to specify what instance a message relates to. This means that when a user queries the state of their instance, the admin can now just grep for the instance UUID, and run from there. Its not perfect yet, in that not all messages use this functionality, but that’s some tech debt that I will take on in Folsom. If you’re a nova developer, then please pass instance= in your log messages where relevant!

This logging functionality isn’t perfect, because if you only have the instance UUID in the method you’re writing, it wont work. It expects full instance dicts because of the way the formatting code works. This is kind of ironic in that the default logging format only includes the UUID. In Folsom I’ll also extend this code so that the right thing happens with UUIDs as well.

Another simple logging tweak I wrote is that tracebacks now have the time and instance included in them. This makes it much easier for admins to determine the context of a traceback in their logs. It should be noted that both of these changes was relatively trivial, but trivial things can often make it much easier for others.

There are two sessions at the Folsom dev summit talking about how to make OpenStack easier for operators to run. One was from me, and the other is from Duncan McGreggor. Neither has been accepted yet, but if I notice that Duncan’s was accepted I’ll drop mine. I’m very very interested in what operations staff feel is currently painful, because having something which is easy to scale and manage is vital to adoption. This is also the core of what I did at Google, and I feel I can make a real contribution here.

I know I’ve come relatively late to the OpenStack party, but there’s heaps more to do here and I’m super enthused to be working on code that I can finally show people again.

Call for papers opens soon

It’s time to start thinking about your talk proposals, because the call for papers is only eight weeks away!

For the 2013 conference, the papers committee are going to be focusing on deep technical content, and things we think are going to really matter in the future — that might range from freedom and privacy, to open source cloud systems, or energy efficient server farms of the future. However, the conference is to a large extent what the speakers make it — if we receive many excellent submissions on a topic, then its sure to be represented at the conference.

The papers committee will be headed by the able combination of Michael Davies and Mary Gardiner, who have done an excellent job in previous years. They’re currently working through the details of the call for papers announcement. I am telling you this now because I want speakers to have plenty of time to prepare for the submissions process, as I think that will produce the highest quality of submissions.

I also wanted to let you know the organising for linux.conf.au 2013 is progressing well. We’re currently in the process of locking in all of our venue arrangements, so we will have some announcements about that soon. We’ve received our first venue contract to sign, which is for the keynote venue. It’s exciting, but at the same time a good reminder that the conference is a big responsibility.

What would you like to see at the conference? I am sure there are things which are topical which I haven’t thought of. Blog or tweet your thoughts (include the hashtag #lca2013 please), or email us at contact@lca2013.linux.org.au.

Wow, qemu-img is fast

I wanted to determine if its worth putting ephemeral images into the libvirt cache at all. How expensive are these images to create? They don’t need to come from the image service, so it can’t be too bad, right? It turns out that qemu-img is very very fast at creating these images, based on the very small data set of my laptop with an ext4 file system…

    mikal@x220:/data/temp$ time qemu-img create -f raw disk 10g
    Formatting 'disk', fmt=raw size=10737418240
    
    real	0m0.315s
    user	0m0.000s
    sys	0m0.004s
    
    mikal@x220:/data/temp$ time qemu-img create -f raw disk 100g
    Formatting 'disk', fmt=raw size=107374182400
    
    real	0m0.004s
    user	0m0.000s
    sys	0m0.000s
    

Perhaps this is because I am using ext4, which does funky extents things when allocating blocks. However, the only ext3 file system I could find at my place is my off site backup disks, which are USB3 attached instead of the SATA2 that my laptop uses. Here’s the number from there:

    $ time qemu-img create -f raw disk 100g
    Formatting 'disk', fmt=raw size=107374182400
    
    real	0m0.055s
    user	0m0.000s
    sys	0m0.004s
    

So still very very fast. Perhaps its the mkfs that’s slow? Here’s a run of creating a ext4 file system inside that 100gb file I just made on my laptop:

    $ time mkfs.ext4 disk
    mke2fs 1.41.14 (22-Dec-2010)
    disk is not a block special device.
    Proceed anyway? (y,n) y
    warning: Unable to get device geometry for disk
    Filesystem label=
    OS type: Linux
    Block size=4096 (log=2)
    Fragment size=4096 (log=2)
    Stride=0 blocks, Stripe width=0 blocks
    6553600 inodes, 26214400 blocks
    1310720 blocks (5.00%) reserved for the super user
    First data block=0
    Maximum filesystem blocks=0
    800 block groups
    32768 blocks per group, 32768 fragments per group
    8192 inodes per group
    Superblock backups stored on blocks:
    	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    	4096000, 7962624, 11239424, 20480000, 23887872
    
    Writing inode tables: done
    Creating journal (32768 blocks): done
    Writing superblocks and filesystem accounting information: done
    
    This filesystem will be automatically checked every 36 mounts or
    180 days, whichever comes first.  Use tune2fs -c or -i to override.
    
    real	0m4.083s
    user	0m0.096s
    sys	0m0.136s
    

That time includes the time it took me to hit the ‘y’ key, as I couldn’t immediately find a flag to stop prompting.

In conclusion, there is nothing slow here. I don’t see why we’d want to cache ephemeral disks and use copy on write for them at all. Its very cheap to just create a new one each time, and it makes the code much simpler.

Slow git review uploads?

jeblair was kind enough to help me debug my problem with slow “git review” uploads for Openstack projects just now. It turns out that part of my standard configuration for ssh is to enable ControlMaster and ControlPersist. I mostly do this because the machines I use at Canonical are a very long way away from my home in Australia, and its nice to have slightly faster connections when you ssh to a machine. However, gerrit is incompatible with these options as best as we can tell.

So, if your git reviews are taking 10 to 20 minutes to upload like mine were, check that you’re not using persistent connections. Excluding review.openstack.org from that part of my configuration has made a massive difference to the speed of uploads for me.