The KSM and I


I spent much of yesterday playing with KSM (Kernel Shared Memory, or Kernel Samepage Merging depending on which universe you come from). Unix kernels store memory in “pages” which are moved in and out of memory as a single block. On most Linux architectures pages are 4,096 bytes long.

KSM is a Linux Kernel feature which scans memory looking for identical pages, and then de-duplicating them. So instead of having two pages, we just have one and have two processes point at that same page. This has obvious advantages if you’re storing lots of repeating data. Why would you be doing such a thing? Well the traditional answer is virtual machines.

Take my employer’s systems for example. We manage virtual learning environments for students, where every student gets a set of virtual machines to do their learning thing on. So, if we have 50 students in a class, we have 50 sets of the same virtual machine. That’s a lot of duplicated memory. The promise of KSM is that instead of storing the same thing 50 times, we can store it once and therefore fit more virtual machines onto a single physical machine.

For my experiments I used libvirt / KVM on Ubuntu 18.04. To ensure KSM was turned on, I needed to:

  • Ensure KSM is turned on. /sys/kernel/mm/ksm/run should contain a “1” if it is enabled. If it is not, just write “1” to that file to enable it.
  • Ensure libvirt is enabling KSM. The KSM value in /etc/defaults/qemu-kvm should be set to “AUTO”.
  • Check KSM metrics:
# grep . /sys/kernel/mm/ksm/*

My lab machines are currently setup with Shaken Fist, so I just quickly launched a few hundred identical VMs. This first graph is that experiment. Its a little hard to see here but on three machines I consumed about about 40gb of RAM with indentical VMs and then waited. After three or so hours I had saved about 2,500 pages of memory.

To be honest, that’s a pretty disappointing result. 2,5000 4kb pages is only about 10mb of RAM, which isn’t very much at all. Also, three hours is a really long time for our workload, where students often fire up their labs for a couple of hours at a time before shutting them down again. If this was as good as KSM gets, it wasn’t for us.

After some pondering, I realised that KSM is configured by default to not work very well. The default value for pages_to_scan is 100, which means each scan run only inspects about half a megabyte of RAM. It would take a very very long time to scan a modern machine that way. So I tried setting pages_to_scan to 1,000,000,000 instead. One billion is an unreasonably large number for the real world, but hey. You update this number by writing a new value to /sys/kernel/mm/ksm/pages_to_scan.

This time we get a much better result — I launched as many VMs as would fit on each machine, and the sat back and waited (well, went to bed acutally). Again the graph is a bit hard to read, but what it is saying is that after 90 minutes KSM had saved me over 300gb of RAM across the three machines. Its still a little too slow for our workload, but for workloads where the VMs are relatively static that’s a real saving.

Now it should be noted that setting pages_to_scan to 1,000,000,000 comes at a cost — each of these machines now has one of its 48 cores dedicated to scanning memory and deduplicating. For my workload that’s something I am ok with because my workload is not CPU bound, but it might not work for you.


Image handlers (in essex)


George asks in the comments on my previous post about loop and nbd devices an interesting question about the behavior of this code on essex. I figured the question was worth bringing out into its own post so that its more visible. I’ve edited George’s question lightly so that this blog post flows reasonably.

Can you please explain the order (and conditions) in which the three methods are used? In my Essex installation, the “img_handlers” is not defined in nova.conf, so it takes the default value “loop,nbd,guestfs”. However, nova is using nbd as the chose method.

The handlers will be used in the order specified — with the caveat that loop doesn’t support Copy On Write (COW) images and will therefore be skipped if the libvirt driver is trying to create a COW image. Whether COW images are used is configured with the use_cow_images flag, which defaults to True. So, loop is being skipped because you’re probably using COW images.

My ssh keys are obtained by cloud-init, and still whenever I start a new instance I see in the nova-compute.logs this sequence of events:

qemu-nbd -c /dev/nbd15 /var/lib/nova/instances/instance-0000076d/disk
kpartx -a /dev/nbd15
mount /dev/mapper/nbd15p1 /tmp/tmpxGBdT0
umount /dev/mapper/nbd15p1
kpartx -d /dev/nbd15
qemu-nbd -d /dev/nbd15

I don’t understand why the mount of the first partition is necessary and what it happens when the partition is mounted.

This is a bit harder than the first bit of the question. What I think is happening is that there are files being injected, and that’s causing the mount. Just because the admin password isn’t being inject doesn’t mean that other things aren’t being injected still. You’d be able to tell what’s happening by grepping your logs for “Injecting .* into image” and seeing what shows up.