Minor questions in Linux file semantics

I’ve known for a long time that if you delete a file on Unix / Linux but that file is open somewhere, the blocks used by the file aren’t freed until that user closes the file (or is terminated), but I was left wondering about some other edge cases.

Shaken Fist has a distributed blob store. It also has a cache of images that virtual machines are using. If the blob store and the image cache are on the same filesystem, sometimes the image cache entry can be a hard link to an entry in the blob store (for example, if the entry in the blob store doesn’t need to be transcoded before use by the virtual machine). However, if they are on different file systems, I instead use a symbolic link.

This raises questions — what happens if you rename a file which is open for writing in a program? What happens if you change a symbolic link to point somewhere else while it is open? I suspect in both cases the right thing happens, but I decided I should test these theories out.

Continue reading “Minor questions in Linux file semantics”

Linux bridges have their MTU overwritten when you add an interface

I discovered last night that network bridges on linux have their Maximum Transmission Unit (MTU) overwritten by whatever is the MTU value of the most recent interface added to the bridge. This is bad. Very bad. Specifically this is bad because MTU matters for accurately describing the capabilities of the network path the packets will travel on, so it shouldn’t be clobbered willy nilly.

Here’s an example of the behaviour:

# ip link add egr-br-ens1f0 mtu 1500 type bridge
# ip link show dev egr-br-ens1f0
3: egr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:33:1b:30:d8:00 brd ff:ff:ff:ff:ff:ff
# ip link add egr-eaa64a-o mtu 8950 type veth peer name egr-eaa64a-i
# ip link show dev egr-br-ens1f0
3: egr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:33:1b:30:d8:00 brd ff:ff:ff:ff:ff:ff
# brctl addif egr-br-ens1f0 egr-eaa64a-o
# ip link show dev egr-br-ens1f0
3: egr-br-ens1f0: <BROADCAST,MULTICAST> mtu 8950 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether da:82:cf:34:13:60 brd ff:ff:ff:ff:ff:ff

So you can see here that the bridge had an MTU of 1,500 bytes. We create a veth pair with an MTU of 8,950 bytes and add it to the bridge. Suddenly the bridge’s MTU is 8,950 bytes!

Perhaps this is my fault — brctl is pretty old school. Let’s use only ip commands to configure the bridge.

# ip link add mgr-br-ens1f0 mtu 1500 type bridge
# ip link show dev mgr-br-ens1f0
6: mgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:d8:df:15:40:01 brd ff:ff:ff:ff:ff:ff
# ip link add mgr-eaa64a-o mtu 8950 type veth peer name mgr-eaa64a-i
# ip link show dev mgr-br-ens1f0
6: mgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:d8:df:15:40:01 brd ff:ff:ff:ff:ff:ff
# ip link set mgr-eaa64a-o master mgr-br-ens1f0
# ip link show dev mgr-br-ens1f0
6: mgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 8950 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 22:55:4a:a8:19:00 brd ff:ff:ff:ff:ff:ff

The same problem occurs. Luckily, you can specify the MTU when you add an interface to a bridge, like this:

# ip link add zgr-br-ens1f0 mtu 1500 type bridge
# ip link show dev zgr-br-ens1f0
9: zgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:54:2c:04:5f:a8 brd ff:ff:ff:ff:ff:ff
# ip link add zgr-eaa64a-o mtu 8950 type veth peer name zgr-eaa64a-i
# ip link show dev zgr-br-ens1f0
9: zgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:54:2c:04:5f:a8 brd ff:ff:ff:ff:ff:ff
# ip link set zgr-eaa64a-o master zgr-br-ens1f0 mtu 1500
# ip link show dev zgr-br-ens1f0
9: zgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ae:59:0b:a6:46:94 brd ff:ff:ff:ff:ff:ff

And that works nicely. In my case, this ended up with me writing code to lookup the MTU of the bridge I was adding the interface to, and then specifying that MTU back when adding the interface. I hope this helps someone else.

The KSM and I

I spent much of yesterday playing with KSM (Kernel Shared Memory, or Kernel Samepage Merging depending on which universe you come from). Unix kernels store memory in “pages” which are moved in and out of memory as a single block. On most Linux architectures pages are 4,096 bytes long.

KSM is a Linux Kernel feature which scans memory looking for identical pages, and then de-duplicating them. So instead of having two pages, we just have one and have two processes point at that same page. This has obvious advantages if you’re storing lots of repeating data. Why would you be doing such a thing? Well the traditional answer is virtual machines.

Take my employer’s systems for example. We manage virtual learning environments for students, where every student gets a set of virtual machines to do their learning thing on. So, if we have 50 students in a class, we have 50 sets of the same virtual machine. That’s a lot of duplicated memory. The promise of KSM is that instead of storing the same thing 50 times, we can store it once and therefore fit more virtual machines onto a single physical machine.

For my experiments I used libvirt / KVM on Ubuntu 18.04. To ensure KSM was turned on, I needed to:

  • Ensure KSM is turned on. /sys/kernel/mm/ksm/run should contain a “1” if it is enabled. If it is not, just write “1” to that file to enable it.
  • Ensure libvirt is enabling KSM. The KSM value in /etc/defaults/qemu-kvm should be set to “AUTO”.
  • Check KSM metrics:
# grep . /sys/kernel/mm/ksm/*
/sys/kernel/mm/ksm/full_scans:891
/sys/kernel/mm/ksm/max_page_sharing:256
/sys/kernel/mm/ksm/merge_across_nodes:1
/sys/kernel/mm/ksm/pages_shared:0
/sys/kernel/mm/ksm/pages_sharing:0
/sys/kernel/mm/ksm/pages_to_scan:100
/sys/kernel/mm/ksm/pages_unshared:0
/sys/kernel/mm/ksm/pages_volatile:0
/sys/kernel/mm/ksm/run:1
/sys/kernel/mm/ksm/sleep_millisecs:200
/sys/kernel/mm/ksm/stable_node_chains:49
/sys/kernel/mm/ksm/stable_node_chains_prune_millisecs:2000
/sys/kernel/mm/ksm/stable_node_dups:1055
/sys/kernel/mm/ksm/use_zero_pages:0

My lab machines are currently setup with Shaken Fist, so I just quickly launched a few hundred identical VMs. This first graph is that experiment. Its a little hard to see here but on three machines I consumed about about 40gb of RAM with indentical VMs and then waited. After three or so hours I had saved about 2,500 pages of memory.

To be honest, that’s a pretty disappointing result. 2,5000 4kb pages is only about 10mb of RAM, which isn’t very much at all. Also, three hours is a really long time for our workload, where students often fire up their labs for a couple of hours at a time before shutting them down again. If this was as good as KSM gets, it wasn’t for us.

After some pondering, I realised that KSM is configured by default to not work very well. The default value for pages_to_scan is 100, which means each scan run only inspects about half a megabyte of RAM. It would take a very very long time to scan a modern machine that way. So I tried setting pages_to_scan to 1,000,000,000 instead. One billion is an unreasonably large number for the real world, but hey. You update this number by writing a new value to /sys/kernel/mm/ksm/pages_to_scan.

This time we get a much better result — I launched as many VMs as would fit on each machine, and the sat back and waited (well, went to bed acutally). Again the graph is a bit hard to read, but what it is saying is that after 90 minutes KSM had saved me over 300gb of RAM across the three machines. Its still a little too slow for our workload, but for workloads where the VMs are relatively static that’s a real saving.

Now it should be noted that setting pages_to_scan to 1,000,000,000 comes at a cost — each of these machines now has one of its 48 cores dedicated to scanning memory and deduplicating. For my workload that’s something I am ok with because my workload is not CPU bound, but it might not work for you.

Mirror traffic during the last day of LCA 2007

It seems obvious to me that videos of LCA 2007 are good. Specifically:

 IPTraf
 # Statistics for eth0 ##########################################################
 #                                                                              #
 #               Total      Total    Incoming   Incoming    Outgoing   Outgoing #
 #             Packets      Bytes     Packets      Bytes     Packets      Bytes #
 # Total:       241091    228940K       96646   18025370      144445    210915K #
 # IP:          241091    225548K       96646   16655328      144445    208892K #
 # TCP:         241086    225547K       96643   16655034      144443    208892K #
 # UDP:              4        412           2        266           2        146 #
 # ICMP:             0          0           0          0           0          0 #
 # Other IP:         1         28           1         28           0          0 #
 # Non-IP:           0          0           0          0           0          0 #
 #                                                                              #
 #                                                                              #
 # Total rates:      49188.4 kbits/sec        Broadcast packets:            0   #
 #                    6592.2 packets/sec      Broadcast bytes:              0   #
 #                                                                              #
 # Incoming rates:    3814.2 kbits/sec                                          #
 #                    2714.4 packets/sec                                        #
 #                                            IP checksum errors:           0   #
 # Outgoing rates:   45374.2 kbits/sec                                          #
 #                    3877.8 packets/sec                                        #
 # Elapsed time:   0:00 #########################################################
  X-exit

Yay for LCA 2007 videos.