Further thoughts on Azure instance start times

My post from the other day about slow instance starts on Azure caused some commentary (mainly on reddit) that prompted me to think more about all this. In the end, there were a few more experiments I wanted to run to see if I could squeeze more performance out of Azure.

First off, looking at the logs from my initial testing it looks like resource groups are slow. The original terraform creates a resource group as part of the test and then cleans it up at the end. What if instead we had a single permanent resource group and created instances within that?

Here is a series of instance starts and deletes using the terraform from the last post:

You’ll notice that there’s no delete value for the last instance. That’s because terraform crashed and never deleted the instance. You can also see that instance starts are somewhat consistent, except for being slower in the second half of the test than the first, and occasionally spiking out to very very slow. Oh, and deletes are almost always really slow.

What happens if we use a permanent resource group and network? This means that all the “instance start terraform” is doing is creating a network interface and then an instance which uses that network interface. It has to be faster, but does it resolve our issues?

The dashed lines are the graph from above, the solid lines are the new data without resource group creation. You can see that abstracting away the resource group work has made a significant performance improvement. Instance start times are now generally under 100 seconds (which is still three times slower than AWS, and four or five times slower than Google).

So is it just that the Australian Azure zones are slow? I re-ran the new terraform against a US datacenter (East US). Here’s a zoom in of just the instance creates with the resource group extracted to make that clearer, for both data centers:

Interestingly, the Australian data center actually performs better than the US one, which isn’t what I would expect at all. You can also see in this test run that we do still see some unexpectedly slow instance launches, although they feel less frequent and smaller when they happen. That might also just be that I’m testing over a weekend and the data center might be more idle.

Looping back, I think we’ve learnt that resource groups are expensive. The last thing I wanted to dig into was what exactly was happening in those spikes where we had resource groups included. Luckily, they were happening about the point I started logging the terraform trace output of the run.

For example, run azure_1576926569_7_0_apply took 18 minutes and 3 seconds to create the instance. For those 18 minutes, terraform logs that the instance was marked by the Azure API as in provisioningState “Creating”. This correlates with operation id c983b272-fa32-4814-b858-adab3da4d9b1 sitting in state “InProgress”, unfortunately there isn’t a reason logged for why that is. So I guess its not possible as an Azure user to work out why things are sometimes slow.

To summarise some advice for terraform users on Azure — don’t create resource groups if you can avoid it. Create global resource groups and then place new objects into them instead. That said, you’re still going to have slower and less consistent performance than other clouds.

Finally, is instance start time a valid metric for cloud performance? Probably not. That said, it is table stakes to be in the conversation. Slow instance starts affect my overall experience of the cloud, as well as the workability of horizontal scaling techniques. This is especially true for instance start times which vary wildly like Azure’s do — I simply can’t trust that I can grow a horizontal scaling set with any sort of reasonable timeframe.

Why is Azure so slow to start instances?

I’ve been playing with terraform recently, and decided to see how different the terraform for launching a simple Ubuntu instance in various clouds is. There are two big questions there for me — how big is the variation between OpenStack derived clouds; and how painful is it to move between the proprietary clouds? Part of this is because terraform doesn’t present a standardised layer of cloud functionality, it has a provider per cloud.

(Although, I suspect there’s nothing stopping someone from writing a libcloud provider or something like that. It is an interesting idea which requires some additional thought.)

My terraform implementations for each cloud are on github if you’re interested. I don’t want to spend a lot of analysis on the actual terraform, because I think the really interesting thing I found isn’t where I expected it to be (there’s a hint in the title for this post). That said, the OpenStack clouds vary mostly by capabilities. vexxhost for example seems to only offer flavors that require boot-from-volume. The proprietary clouds are complete re-writes, but are generally relatively simple and well documented.

However, that interesting accidental thing — as best as I can tell, Microsoft Azure is really really slow to launch instances. The graph below presents five instance launches on each cloud I tested:

As you can see, Vault, Vexxhost, and AWS are basically all in the same ballpark. Google and Azure are outliers, with Google being crazy fast (but also very slow to delete instances, a metric not presented here), and Azure being more than three times slower than everyone else.

Instance launch time isn’t a great metric to be honest, but it does matter. For example if you were trying to autoscale a web tier or a kubernetes cluster, then waiting over two minutes just for the instance to boot before it can be configured and added to the cluster is probably not ok.

I wonder why Azure is so slow?

I did some further exploring after writing this post and was able to improve performance by changing how I handled resource groups in the terraform. The performance still isn’t great though. You can read more about that in a separate post if you’d like.

Python effective TLD library update

The effective TLD library is now being used for a couple of projects of mine, but I’ve had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn’t a match. That’s expensive.

I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That’s a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches…. In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.

I’ve updated the code at http://www.stillhq.com/python/etld/etld.py.

MySQL Tech Talks

Three intrepid MySQLers came to Google after the user conference to give internal tech talks. They were kind enough to agree to us hosting them for other people to see. The first two are up, so I’ll mention those now, and put a link to the last one when it’s available…

Click on the thumbnail to be taken to the video.

Jay Pipes is a co-author of the recently published Pro MySQL (Apress, 2005), which covers all of the newest MySQL 5 features, as well as in-depth discussion and analysis of the MySQL server architecture, storage engines, transaction processing, benchmarking, and advanced SQL scenarios. You can also see his name on articles appearing in Linux Magazine and can read more articles about MySQL at his website. Jay Pipes is MySQL’s Community Relations Manager for North America.

Learn where to best focus your attention when tuning the performance of your applications and database servers, and how to effectively find the “low hanging fruit” on the tree of bottlenecks. It’s not rocket science, but with a bit of acquired skill and experience, and of course good habits, you too can do this magic!

Timour Katchaounov

The first part of this talk describes the main principles behind MySQL’s query optimiser and execution engine, how the optimiser transforms queries into executable query plans, what these plans look like, and how they are executed.

The second part of the talk describes the major improvements in the query engine of MySQL 5.0, and how these improvements can benefit the users of MySQL 5.0. The “greedy” optimiser reduces compilation time for big queries with orders of magnitude. The “index merge” access method provides a way to use more than one index for the same query. For faster plan execution and to allow better join orders, the 5.0 optimiser transforms most outer joins into inner joins.

The outer joins that cannot be transformed into inner ones are executed in a pipeline manner, so that no intermediate results need to be materialised. Finally, some GROUP BY and DISTINCT queries can be executed much faster thanks to “loose index scan” technique that reads only a fraction of an index.

The talk concludes with the near-future plans for new features coming in the next versions of MySQL.

Stewart Smith works for MySQL AB as a software engineer working on MySQL Cluster. He is an active member of the free and open source software community, especially in Australia. Although Australian, he does not dress like Steve Irwin—although if he wrestled crocodiles he probably would. He is a fan of great coffee, great beer, and is currently 39,000 feet above sea level.

part 1 – Introduction to MySQL Cluster The NDB storage engine (MySQL Cluster) is a high-availability storage engine for MySQL. It provides synchronous replication between storage nodes and many mysql servers having a consistent view of the database. In 4.1 and 5.0 it’s a main memory database, but in 5.1 non-indexed attributes can be stored on disk. NDB also provides a lot of determinism in system resource usage. I’ll talk a bit about that.

part 2 – new features in 5.1 including cluster to cluster replication, disk based data and a bunch of other things. anybody that is attending the mysql users conference may find this eerily familiar.

I can also talk about latest-and-totally-greatest developments and future stuff we’re working on. i can also take questions and constructive abuse 🙂

You can see a complete list of the MySQL tech talks at Google here.

Update: added Stewart’s talk now that it is online.