Starting your first instance on Shaken Fist (a video tutorial)

As a bit of an experiment, I’ve made this quick and dirty “vlog” style tutorial video to show you how to install Shaken Fist on a single machine and boot your first instance. I demonstrate how to install, setup your first virtual network, start the instance, inspect events that the instance has experienced, and then log in.

Let me know if you think its useful.

Deciding when to filter out large scale refactorings from code analysis

I want to be able to see the level of change between OpenStack releases. However, there are a relatively small number of changes with simply huge amounts of delta in them — they’re generally large refactors or the delete which happens when part of a repository is spun out into its own project.

I therefore wanted to explore what was a reasonable size for a change in OpenStack so that I could decide what maximum size to filter away as likely to be a refactor. After playing with a couple of approaches, including just randomly picking a number, it seems the logical way to decide is to simply plot a histogram of the various sizes, and then pick a reasonable place on the curve as the cutoff. Due to the large range of values (from zero lines of change to over a million!), I ended up deciding a logarithmic axis was the way to go.

For the projects listed in the OpenStack compute starter kit reference set, that produces the following histogram:A histogram of the sizes of various OpenStack commitsI feel that filtering out commits over 10,000 lines of delta feels justified based on that graph. For reference, the raw histogram buckets are:

Commit sizeCount
< 225747
< 11237436
< 101326314
< 1001148865
< 1000116928
< 1000013277
< 1000001522
< 1000000113

A quick summary of OpenStack release tags

I wanted a quick summary of OpenStack git release tags for a talk I am working on, and it turned out to be way more complicated than I expected. I ended up having to compile a table, and then turn that into a code snippet. In case its useful to anyone else, here it is:

ReleaseRelease dateFinal release tag
AustinOctober 20102010.1
BexarFebruary 20112011.1
CactusApril 20112011.2
DiabloSeptember 20112011.3
EssexApril 20122012.1.3
FolsomSeptember 20122012.2.4
GrizzlyApril 20132013.1.5
HavanaOctober 20132013.2.4
IcehouseApril 20142014.1.5
JunoOctober 20142014.2.4
KiloApril 20152015.1.4
LibertyOctober 2015Glance: 11.0.2
Keystone: 8.1.2
Neutron: 7.2.0
Nova: 12.0.6
MitakaApril 2016Glance: 12.0.0
Keystone: 9.3.0
Neutron: 8.4.0
Nova: 13.1.4
NewtonOctober 2016Glance: 13.0.0
Keystone: 10.0.3
Neutron: 9.4.1
Nova: 14.1.0
OcataFebruary 2017Glance: 14.0.1
Keystone: 11.0.4
Neutron: 10.0.7
Nova: 15.1.5
PikeAugust 2017Glance: 15.0.2
Keystone: 12.0.3
Neutron: 11.0.8
Nova: 16.1.8
QueensFebruary 2018Glance: 16.0.1
Keystone: 13.0.4
Neutron: 12.1.1
Nova: 17.0.13
RockyAugust 2018Glance: 17.0.1
Keystone: 14.2.0
Neutron: 13.0.7
Nova: 18.3.0
SteinApril 2019Glance: 18.0.1
Keystone: 15.0.1
Neutron: 14.4.2
Nova: 19.3.2
TrainOctober 2019Glance: 19.0.4
Keystone: 16.0.1
Neutron: 15.3.0
Nova: 20.4.1
UssuriMay 2020Glance: 20.0.1
Keystone: 17.0.0
Neutron: 16.2.0
Nova: 21.1.1
VictoriaOctober 2020Glance: 21.0.0
Keystone: 18.0.0
Neutron: 17.0.0
Nova: 22.0.1

Or in python form for those so inclined:

RELEASE_TAGS = {
    'austin': {'all': '2010.1'},
    'bexar': {'all': '2011.1'},
    'cactus': {'all': '2011.2'},
    'diablo': {'all': '2011.3'},
    'essex': {'all': '2012.1.3'},
    'folsom': {'all': '2012.2.4'},
    'grizzly': {'all': '2013.1.5'},
    'havana': {'all': '2013.2.4'},
    'icehouse': {'all': '2014.1.5'},
    'juno': {'all': '2014.2.4'},
    'kilo': {'all': '2015.1.4'},
    'liberty': {
        'glance': '11.0.2',
        'keystone': '8.1.2',
        'neutron': '7.2.0',
        'nova': '12.0.6'
    },
    'mitaka': {
        'glance': '12.0.0',
        'keystone': '9.3.0',
        'neutron': '8.4.0',
        'nova': '13.1.4'
    },
    'newton': {
        'glance': '13.0.0',
        'keystone': '10.0.3',
        'neutron': '9.4.1',
        'nova': '14.1.0'
    },
    'ocata': {
        'glance': '14.0.1',
        'keystone': '11.0.4',
        'neutron': '10.0.7',
        'nova': '15.1.5'
    },
    'pike': {
        'glance': '15.0.2',
        'keystone': '12.0.3',
        'neutron': '11.0.8',
        'nova': '16.1.8'
    },
    'queens': {
        'glance': '16.0.1',
        'keystone': '13.0.4',
        'neutron': '12.1.1',
        'nova': '17.0.13'
    },
    'rocky': {
        'glance': '17.0.1',
        'keystone': '14.2.0',
        'neutron': '13.0.7',
        'nova': '18.3.0'
    },
    'stein': {
        'glance': '18.0.1',
        'keystone': '15.0.1',
        'neutron': '14.4.2',
        'nova': '19.3.2'
    },
    'train': {
        'glance': '19.0.4',
        'keystone': '16.0.1',
        'neutron': '15.3.0',
        'nova': '20.4.1'
    },
    'ussuri': {
        'glance': '20.0.1',
        'keystone': '17.0.0',
        'neutron': '16.2.0',
        'nova': '21.1.1'
    },
    'victoria': {
        'glance': '21.0.0',
        'keystone': '18.0.0',
        'neutron': '17.0.0',
        'nova': '22.0.1'
    }
}

Rejected talk proposal: Shaken Fist, thought experiments in simpler IaaS clouds

This proposal was submitted for FOSDEM 2021. Given that acceptances were meant to be sent out on 25 December and its basically a week later I think we can assume that its been rejected. I’ve recently been writing up my rejected proposals, partially because I’ve put in the effort to write them and they might be useful elsewhere, but also because I think its important to demonstrate that its not unusual for experienced speakers to be rejected from these events.


OpenStack today is a complicated beast — not only does it try to perform well for large clusters, but it also embraces a diverse set of possible implementations from hypervisors, storage, networking, and more. This was a deliberate tactical choice made by the OpenStack community years ago, forming a so called “Big Tent” for vendors to collaborate in to build Open Source cloud options. It made a lot of sense at the time to be honest. However, OpenStack today finds itself constrained by the large number of permutations it must support, ten years of software and backwards compatability legacy, and a decreasing investment from those same vendors that OpenStack courted so actively.

Shaken Fist makes a series of simplifying assumptions that allow it to achieve a surprisingly large amount in not a lot of code. For example, it supports only one hypervisor, one hypervisor OS, one networking implementation, and lacks an image service. It tries hard to be respectful of compute resources while idle, and as fast as possible to deploy resources when requested — its entirely possible to deploy a new VM and start it booting in less than a second for example (if the boot image is already held in cache). Shaken Fist is likely a good choice for small deployments such as home labs and telco edge applications. It is unlikely to be a good choice for large scale compute however.

Introducing Shaken Fist

The first public commit to what would become OpenStack Nova was made ten years ago today — at Thu May 27 23:05:26 2010 PDT to be exact. So first off, happy tenth birthday to Nova!

A lot has happened in that time — OpenStack has gone from being two separate Open Source projects to a whole ecosystem, developers have come and gone (and passed away), and OpenStack has weathered the cloud wars of the last decade. OpenStack survived its early growth phase by deliberately offering a “big tent” to the community and associated vendors, with an expansive definition of what should be included. This has resulted in most developers being associated with a corporate sponser, and hence the decrease in the number of developers today as corporate interest wanes — OpenStack has never been great at attracting or retaining hobbist contributors.

My personal involvement with OpenStack started in November 2011, so while I missed the very early days I was around for a lot and made many of the mistakes that I now see in OpenStack.

What do I see as mistakes in OpenStack in hindsight? Well, embracing vendors who later lose interest has been painful, and has increased the complexity of the code base significantly. Nova itself is now nearly 400,000 lines of code, and that’s after splitting off many of the original features of Nova such as block storage and networking. Additionally, a lot of our initial assumptions are no longer true — for example in many cases we had to write code to implement things, where there are now good libraries available from third parties.

That’s not to say that OpenStack is without value — I am a daily user of OpenStack to this day, and use at least three OpenStack public clouds at the moment. That said, OpenStack is a complicated beast with a lot of legacy that makes it hard to maintain and slow to change.

For at least six months I’ve felt the desire for a simpler cloud orchestration layer — both for my own personal uses, and also as a test bed for ideas for what a smaller, simpler cloud might look like. My personal use case involves a relatively small environment which echos what we now think of as edge compute — less than 10 RU of machines with a minimum of orchestration and management overhead.

At the time that I was thinking about these things, the Australian bushfires and COVID-19 came along, and presented me with a lot more spare time than I had expected to have. While I’m still blessed to be employed, all of my social activities have been cancelled, so I find myself at home at a loose end on weekends and evenings at lot more than before.

Thus Shaken Fist was born — named for a Simpson’s meme, Shaken Fist is a deliberately small and highly opinionated cloud implementation aimed at working well in small deployments such as homes, labs, edge compute locations, deployed systems, and so forth.

I’d taken a bit of trouble with each feature in Shaken Fist to think through what the simplest and highest value way of doing something is. For example, instances always get a config drive and there is no metadata server. There is also only one supported type of virtual networking, and one supported hypervisor. That said, this means Shaken Fist is less than 5,000 lines of code, and small enough that new things can be implemented very quickly by a single middle aged developer.

Shaken Fist definitely has feature gaps — API authentication and scheduling are the most obvious at the moment — but I have plans to fill those when the time comes.

I’m not sure if Shaken Fist is useful to others, but you never know. Its apache2 licensed, and available on github if you’re interested.

Configuring load balancing and location headers on Google Cloud

I have a need at the moment to know where my users are in the world. This helps me to identify what compute resources to serve their request with in order to reduce the latency they experience. So how do you do that thing with Google Cloud?

The first step is to setup a series of test backends to send traffic to. I built three regions: Sydney; London; and Los Angeles. It turns out in hindsight that wasn’t actually nessesary though — this would work with a single backend just as well. For my backends I chose a minimal Ubuntu install, running this simple backend HTTP service.

I had some initial trouble finding a single page which walked through the setup of the Google Cloud load balancer to do what I wanted, which is the main reason for writing this post. The steps are:

Create your test instances and configure the backend on them. I ended up with a setup like this:

A list of google cloud VMs

Next setup instance groups to contain these instances. I chose unmanaged instance groups (that is, I don’t want autoscaling). You need to create one per region.

A list of google cloud instance groups

But wait! There’s one more layer of abstraction. We need a backend service. The configuration for these is cunningly hidden on the load balancing page, on a separate tab. Create a service which contains our three instance groups:

A sample backend service

I’ve also added a health check to my service, which just requests “/healthz” from each instance and expects a response of “OK” for healthy backends.

The backend service is also where we configure our extra headers. Click on the “advanced configurations” link, and more options appear:

Additional backend service options

Here I setup the extra HTTP headers the load balancer should insert: X-Region; X-City; and X-Lat-Lon.

And finally we can configure the load balancer. I selected a “HTTP(S) load balancer”, as I only care about incoming HTTP and HTTPS traffic. Obviously you set the load balancer to route traffic from the Internet to your VMs, and you wire the backend of the load balancer to your service. Select your backend service for the backend.

Now we can test! If I go to my load balancer in a web browser, I now get a result like this:

The top part of the page is just the HTTP headers from the request. You can see that we’re now getting helpful location headers. Mission accomplished!

Setting up VXLAN between nested virt VMs on Google Compute Engine

I wanted to play with a VXLAN mesh between VMs on more than one hypervisor node, but the setup for VXLAN ended up being a separate post because it was a bit long. Read that post first if you want to follow the instructions here.

Now that we have a working VXLAN mesh between our two nodes we can move on to installing libvirt (which is called libvirt-daemon-system on Debian, not libvirt-bin as on Ubuntu):

sudo apt-get install -y qemu-kvm libvirt-daemon-system
sudo virsh net-start default
sudo virsh net-autostart --network default

I’m going to use a little python helper to launch my VMs, so I need some other dependancies as well:

sudo apt-get install -y python3-pip pkg-config libvirt-dev git

git clone https://github.com/mikalstill/shakenfist
cd shakenfist
git checkout 6bfac153d249752b27d224ad9d079095b640498e

sudo mkdir /srv/shakenfist
sudo cp template.debian.xml /srv/shakenfist/template.xml
sudo pip3 install -r requirements.txt

Let’s launch a quick test VM to make sure the helper works:

sudo python3 daemon.py
sudo virsh list

You can destroy that VM for now, it was just testing the install.

sudo virsh destroy ...name...

Next we need to tweak the template that shakenfist is using to start instances so that it uses the bridge for networking (that template is the one you copied to /srv/shakenfist/template.xml earlier). Replace the interface section in the template with this on both nodes:

<interface type='bridge'>
  <mac address={{eth0_mac}}/>
  <source bridge='br-vxlan0'/>
  <model type='virtio'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>

I know the bridge mentioned here doesn’t exist yet, but we’ll deal with that in a second. Before we start VMs though, we need a way of getting IP addresses to them. shakenfist can configure interfaces using config drive, but I’d prefer to use DHCP because who doesn’t love some additional complexity?

On one of the nodes install docker:


sudo apt-get install apt-transport-https ca-certificates curl gnupg2 software-properties-common
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

Now we can setup DHCP. Create a place for the configuration file:

sudo mkdir /srv/shakenfist/dhcp

And then create the configuration file at /srv/shakenfist/dhcp/dhcpd.conf with contents like this:

default-lease-time 3600;
max-lease-time 7200;
option domain-name-servers 8.8.8.8;
authoritative;

subnet 192.168.200.0 netmask 255.255.255.0 {
  option routers 192.168.1.1;
  option broadcast-address 192.168.1.255;

  pool {
    range 192.168.200.10 192.168.200.254;
  }
}

Before we can start dhcpd, we need to move the VXLAN device into a bridge so we can add a device for the DHCP server to it. First off remove the vxlan0 device from the last post:

sudo ip link set down dev vxlan0
sudo ip link del vxlan0

And now recreate it with a bridge:

sudo ip link add vxlan0 type vxlan id 42 dev eth0 dstport 0
sudo bridge fdb append to 00:00:00:00:00:00 dst 34.70.161.180 dev vxlan0
sudo ip link add br-vxlan0 type bridge
sudo ip link set vxlan0 master br-vxlan0
sudo ip link set vxlan0 up
sudo ip link set br-vxlan0 up
sudo ip link add dhcp-vxlan0 type veth peer name dhcp-vxlan0p
sudo ip link set dhcp-vxlan0p master br-vxlan0
sudo ip link set dhcp-vxlan0 up
sudo ip link set dhcp-vxlan0p up
sudo ip addr add 192.168.200.1/24 dev dhcp-vxlan0

This block of commands:

  • recreated the vxlan0 interface
  • added it to the mesh with the other node again
  • created a bridge named br-vxlan0
  • moved the vxlan0 interface into it
  • created a veth pair called dhcp-vxlan0 and dhcp-vlan0p
  • moved the peer part of that veth pair into the bridge
  • and then configured an IP on the external half of the veth pair

To make the bridge survive reboots you would need to add it to either /etc/network/interfaces or /etc/netplan/01-netcfg.yml depending on your distribution, but that’s outside the scope of this post.

You should be able to ping again. From the other node give it a try:

$ ping 192.168.200.1
PING 192.168.200.1 (192.168.200.1) 56(84) bytes of data.
64 bytes from 192.168.200.1: icmp_seq=1 ttl=64 time=19.3 ms
64 bytes from 192.168.200.1: icmp_seq=2 ttl=64 time=0.571 ms

We need to do something similar on the other node so it can run VMs as well. It is a tiny bit simpler because there wont be any DHCP there however, and remembering that you need to change 35.223.115.132 to the IP of your first node:

sudo ip link set down dev vxlan0
sudo ip link del vxlan0

sudo ip link add vxlan0 type vxlan id 42 dev eth0 dstport 0
sudo  bridge fdb append to 00:00:00:00:00:00 dst 35.223.115.132 dev vxlan0
sudo ip link add br-vxlan0 type bridge
sudo ip link set vxlan0 master br-vxlan0
sudo ip link set vxlan0 up
sudo ip link set br-vxlan0 up

Note that now we can’t do a ping test because the second VM no longer consumes an IP for the base OS.

Now we can start the docker container with dhcpd listening on dhcp-vxlan0:

sudo docker run -it --rm --init --net host -v /srv/shakenfist/dhcp:/data networkboot/dhcpd dhcp-vxlan0

This runs dhcpd interactively so we can see what happens. Now try starting a VM on the other node:

sudo python3 daemon.py

You can watch the VM booting using the “virsh console” command with the name of the vm from “virsh list“. The dhcpd process should show you something like this:

sudo docker run -it --rm --init --net host -v /srv/shakenfist/dhcp:/data networkboot/dhcpd dhcp-vxlan0
Internet Systems Consortium DHCP Server 4.3.5
Copyright 2004-2016 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
Config file: /data/dhcpd.conf
Database file: /data/dhcpd.leases
PID file: /var/run/dhcpd.pid
Wrote 0 leases to leases file.
Listening on LPF/dhcp-vxlan0/06:ff:bc:7d:11:e3/192.168.200.0/24
Sending on   LPF/dhcp-vxlan0/06:ff:bc:7d:11:e3/192.168.200.0/24
Sending on   Socket/fallback/fallback-net
Server starting service.
DHCPDISCOVER from ee:95:4d:40:ca:a6 via dhcp-vxlan0
DHCPOFFER on 192.168.200.10 to ee:95:4d:40:ca:a6 (foo) via dhcp-vxlan0
DHCPREQUEST for 192.168.200.10 (192.168.200.1) from ee:95:4d:40:ca:a6 (foo) via dhcp-vxlan0
DHCPACK on 192.168.200.10 to ee:95:4d:40:ca:a6 (foo) via dhcp-vxlan0

You can see here that our new VM got the IP 192.168.200.10 from the DHCP server! It is moments like this when you don’t realise that this blog post took me hours to write that I feel really smart.

If we started a VM on the first node (the same command as for the second node), we’d now have two VMs on a virtual network which had working DHCP and could ping each other. I think that’s enough for one evening.

Setting up VXLAN on Google Compute Engine

So my ultimate goal here is to try out VXLAN between some VMs on instances in Google compute engine, but today I’m just going to get VXLAN working because that took a fair bit longer than I expected. First off, boot your instances — because I will need nested virt later I chose two instances on Google Cloud. Please note that you need to do a bit of a dance to turn on nested virt there. I also chose to use Debian for this experiment:

gcloud compute instances create vx-1 --zone us-central1-b --min-cpu-platform "Intel Haswell" --image nested-vm-image

Now do those standard things you do to all new instances:

sudo apt-get update
sudo apt-get dist-upgrade -y

Now let’s setup VXLAN between the two nodes, with a big nod to this web page. First create a VXLAN interface on each machine (if you care about the port your VXLAN traffic is on being to IANA standards, see the postscript at the end of this):

sudo ip link add vxlan0 type vxlan id 42 dev eth0 dstport 0

Now we need to put the two nodes into a mesh, where 34.70.161.180 is the IP of the node we are not running this command on and the IP address for the second command needs to be different on each machine.

sudo bridge fdb append to 00:00:00:00:00:00 dst 34.70.161.180 dev vxlan0
sudo ip addr add 192.168.200.1/24 dev vxlan0
sudo ip link set up dev vxlan0

I am pretty sure that this style of mesh (all nodes connected) wouldn’t scale past non-trivial sizes, but hey baby steps right? Finally, because we’re using Google Cloud we need to add firewall rules to allow our traffic into the instances:

Note that these rules are a source of confusion for me right now. I wanted (and configured) VXLAN. So why do I need to allow OTV for this to work? I suspect Linux has politely ignored my request and used OTV not VXLAN for my traffic.

We should now be able to ping those newly configured IP addresses from each machine:

ping 192.168.200.2 -c 1
PING 192.168.200.2 (192.168.200.2) 56(84) bytes of data.
64 bytes from 192.168.200.2: icmp_seq=1 ttl=64 time=1.76 ms

Which produces traffic like this on the underlay network:

tcpdump -n -i eth0 host 34.70.161.180
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:01:58.159092 IP 10.128.0.9.59341 > 34.70.161.180.8472: OTV, flags [I] (0x08), overlay 0, instance 42
IP 192.168.200.1 > 192.168.200.2: ICMP echo request, id 20119, seq 1, length 64
09:01:58.160786 IP 34.70.161.180.48471 > 10.128.0.9.8472: OTV, flags [I] (0x08), overlay 0, instance 42
IP 192.168.200.2 > 192.168.200.1: ICMP echo reply, id 20119, seq 1, length 64
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel

Hopefully this is helpful to someone else. Thanks again to Joe Julian for a very helpful post.

Postscript: Dale Shaw pointed out on twitter that I might still be talking VXLAN, just on a weird port. This is supported by this comment I found on the internets: “when VXLAN was first implemented in linux, UDP ports were not specified. Many vendors use 8472, and Linux uses the same port. Later, IANA allocated 4789 as the port. If you need to use the IANA port, you need to specify it with dstport”.

Further thoughts on Azure instance start times

My post from the other day about slow instance starts on Azure caused some commentary (mainly on reddit) that prompted me to think more about all this. In the end, there were a few more experiments I wanted to run to see if I could squeeze more performance out of Azure.

First off, looking at the logs from my initial testing it looks like resource groups are slow. The original terraform creates a resource group as part of the test and then cleans it up at the end. What if instead we had a single permanent resource group and created instances within that?

Here is a series of instance starts and deletes using the terraform from the last post:

You’ll notice that there’s no delete value for the last instance. That’s because terraform crashed and never deleted the instance. You can also see that instance starts are somewhat consistent, except for being slower in the second half of the test than the first, and occasionally spiking out to very very slow. Oh, and deletes are almost always really slow.

What happens if we use a permanent resource group and network? This means that all the “instance start terraform” is doing is creating a network interface and then an instance which uses that network interface. It has to be faster, but does it resolve our issues?

The dashed lines are the graph from above, the solid lines are the new data without resource group creation. You can see that abstracting away the resource group work has made a significant performance improvement. Instance start times are now generally under 100 seconds (which is still three times slower than AWS, and four or five times slower than Google).

So is it just that the Australian Azure zones are slow? I re-ran the new terraform against a US datacenter (East US). Here’s a zoom in of just the instance creates with the resource group extracted to make that clearer, for both data centers:

Interestingly, the Australian data center actually performs better than the US one, which isn’t what I would expect at all. You can also see in this test run that we do still see some unexpectedly slow instance launches, although they feel less frequent and smaller when they happen. That might also just be that I’m testing over a weekend and the data center might be more idle.

Looping back, I think we’ve learnt that resource groups are expensive. The last thing I wanted to dig into was what exactly was happening in those spikes where we had resource groups included. Luckily, they were happening about the point I started logging the terraform trace output of the run.

For example, run azure_1576926569_7_0_apply took 18 minutes and 3 seconds to create the instance. For those 18 minutes, terraform logs that the instance was marked by the Azure API as in provisioningState “Creating”. This correlates with operation id c983b272-fa32-4814-b858-adab3da4d9b1 sitting in state “InProgress”, unfortunately there isn’t a reason logged for why that is. So I guess its not possible as an Azure user to work out why things are sometimes slow.

To summarise some advice for terraform users on Azure — don’t create resource groups if you can avoid it. Create global resource groups and then place new objects into them instead. That said, you’re still going to have slower and less consistent performance than other clouds.

Finally, is instance start time a valid metric for cloud performance? Probably not. That said, it is table stakes to be in the conversation. Slow instance starts affect my overall experience of the cloud, as well as the workability of horizontal scaling techniques. This is especially true for instance start times which vary wildly like Azure’s do — I simply can’t trust that I can grow a horizontal scaling set with any sort of reasonable timeframe.

Why is Azure so slow to start instances?

I’ve been playing with terraform recently, and decided to see how different the terraform for launching a simple Ubuntu instance in various clouds is. There are two big questions there for me — how big is the variation between OpenStack derived clouds; and how painful is it to move between the proprietary clouds? Part of this is because terraform doesn’t present a standardised layer of cloud functionality, it has a provider per cloud.

(Although, I suspect there’s nothing stopping someone from writing a libcloud provider or something like that. It is an interesting idea which requires some additional thought.)

My terraform implementations for each cloud are on github if you’re interested. I don’t want to spend a lot of analysis on the actual terraform, because I think the really interesting thing I found isn’t where I expected it to be (there’s a hint in the title for this post). That said, the OpenStack clouds vary mostly by capabilities. vexxhost for example seems to only offer flavors that require boot-from-volume. The proprietary clouds are complete re-writes, but are generally relatively simple and well documented.

However, that interesting accidental thing — as best as I can tell, Microsoft Azure is really really slow to launch instances. The graph below presents five instance launches on each cloud I tested:

As you can see, Vault, Vexxhost, and AWS are basically all in the same ballpark. Google and Azure are outliers, with Google being crazy fast (but also very slow to delete instances, a metric not presented here), and Azure being more than three times slower than everyone else.

Instance launch time isn’t a great metric to be honest, but it does matter. For example if you were trying to autoscale a web tier or a kubernetes cluster, then waiting over two minutes just for the instance to boot before it can be configured and added to the cluster is probably not ok.

I wonder why Azure is so slow?

I did some further exploring after writing this post and was able to improve performance by changing how I handled resource groups in the terraform. The performance still isn’t great though. You can read more about that in a separate post if you’d like.