Minor questions in Linux file semantics

I’ve known for a long time that if you delete a file on Unix / Linux but that file is open somewhere, the blocks used by the file aren’t freed until that user closes the file (or is terminated), but I was left wondering about some other edge cases.

Shaken Fist has a distributed blob store. It also has a cache of images that virtual machines are using. If the blob store and the image cache are on the same filesystem, sometimes the image cache entry can be a hard link to an entry in the blob store (for example, if the entry in the blob store doesn’t need to be transcoded before use by the virtual machine). However, if they are on different file systems, I instead use a symbolic link.

This raises questions — what happens if you rename a file which is open for writing in a program? What happens if you change a symbolic link to point somewhere else while it is open? I suspect in both cases the right thing happens, but I decided I should test these theories out.

Continue reading “Minor questions in Linux file semantics”

Linux bridges have their MTU overwritten when you add an interface

I discovered last night that network bridges on linux have their Maximum Transmission Unit (MTU) overwritten by whatever is the MTU value of the most recent interface added to the bridge. This is bad. Very bad. Specifically this is bad because MTU matters for accurately describing the capabilities of the network path the packets will travel on, so it shouldn’t be clobbered willy nilly.

Here’s an example of the behaviour:

# ip link add egr-br-ens1f0 mtu 1500 type bridge
# ip link show dev egr-br-ens1f0
3: egr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:33:1b:30:d8:00 brd ff:ff:ff:ff:ff:ff
# ip link add egr-eaa64a-o mtu 8950 type veth peer name egr-eaa64a-i
# ip link show dev egr-br-ens1f0
3: egr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:33:1b:30:d8:00 brd ff:ff:ff:ff:ff:ff
# brctl addif egr-br-ens1f0 egr-eaa64a-o
# ip link show dev egr-br-ens1f0
3: egr-br-ens1f0: <BROADCAST,MULTICAST> mtu 8950 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether da:82:cf:34:13:60 brd ff:ff:ff:ff:ff:ff

So you can see here that the bridge had an MTU of 1,500 bytes. We create a veth pair with an MTU of 8,950 bytes and add it to the bridge. Suddenly the bridge’s MTU is 8,950 bytes!

Perhaps this is my fault — brctl is pretty old school. Let’s use only ip commands to configure the bridge.

# ip link add mgr-br-ens1f0 mtu 1500 type bridge
# ip link show dev mgr-br-ens1f0
6: mgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:d8:df:15:40:01 brd ff:ff:ff:ff:ff:ff
# ip link add mgr-eaa64a-o mtu 8950 type veth peer name mgr-eaa64a-i
# ip link show dev mgr-br-ens1f0
6: mgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:d8:df:15:40:01 brd ff:ff:ff:ff:ff:ff
# ip link set mgr-eaa64a-o master mgr-br-ens1f0
# ip link show dev mgr-br-ens1f0
6: mgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 8950 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 22:55:4a:a8:19:00 brd ff:ff:ff:ff:ff:ff

The same problem occurs. Luckily, you can specify the MTU when you add an interface to a bridge, like this:

# ip link add zgr-br-ens1f0 mtu 1500 type bridge
# ip link show dev zgr-br-ens1f0
9: zgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:54:2c:04:5f:a8 brd ff:ff:ff:ff:ff:ff
# ip link add zgr-eaa64a-o mtu 8950 type veth peer name zgr-eaa64a-i
# ip link show dev zgr-br-ens1f0
9: zgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:54:2c:04:5f:a8 brd ff:ff:ff:ff:ff:ff
# ip link set zgr-eaa64a-o master zgr-br-ens1f0 mtu 1500
# ip link show dev zgr-br-ens1f0
9: zgr-br-ens1f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ae:59:0b:a6:46:94 brd ff:ff:ff:ff:ff:ff

And that works nicely. In my case, this ended up with me writing code to lookup the MTU of the bridge I was adding the interface to, and then specifying that MTU back when adding the interface. I hope this helps someone else.

The KSM and I

I spent much of yesterday playing with KSM (Kernel Shared Memory, or Kernel Samepage Merging depending on which universe you come from). Unix kernels store memory in “pages” which are moved in and out of memory as a single block. On most Linux architectures pages are 4,096 bytes long.

KSM is a Linux Kernel feature which scans memory looking for identical pages, and then de-duplicating them. So instead of having two pages, we just have one and have two processes point at that same page. This has obvious advantages if you’re storing lots of repeating data. Why would you be doing such a thing? Well the traditional answer is virtual machines.

Take my employer’s systems for example. We manage virtual learning environments for students, where every student gets a set of virtual machines to do their learning thing on. So, if we have 50 students in a class, we have 50 sets of the same virtual machine. That’s a lot of duplicated memory. The promise of KSM is that instead of storing the same thing 50 times, we can store it once and therefore fit more virtual machines onto a single physical machine.

For my experiments I used libvirt / KVM on Ubuntu 18.04. To ensure KSM was turned on, I needed to:

  • Ensure KSM is turned on. /sys/kernel/mm/ksm/run should contain a “1” if it is enabled. If it is not, just write “1” to that file to enable it.
  • Ensure libvirt is enabling KSM. The KSM value in /etc/defaults/qemu-kvm should be set to “AUTO”.
  • Check KSM metrics:
# grep . /sys/kernel/mm/ksm/*
/sys/kernel/mm/ksm/full_scans:891
/sys/kernel/mm/ksm/max_page_sharing:256
/sys/kernel/mm/ksm/merge_across_nodes:1
/sys/kernel/mm/ksm/pages_shared:0
/sys/kernel/mm/ksm/pages_sharing:0
/sys/kernel/mm/ksm/pages_to_scan:100
/sys/kernel/mm/ksm/pages_unshared:0
/sys/kernel/mm/ksm/pages_volatile:0
/sys/kernel/mm/ksm/run:1
/sys/kernel/mm/ksm/sleep_millisecs:200
/sys/kernel/mm/ksm/stable_node_chains:49
/sys/kernel/mm/ksm/stable_node_chains_prune_millisecs:2000
/sys/kernel/mm/ksm/stable_node_dups:1055
/sys/kernel/mm/ksm/use_zero_pages:0

My lab machines are currently setup with Shaken Fist, so I just quickly launched a few hundred identical VMs. This first graph is that experiment. Its a little hard to see here but on three machines I consumed about about 40gb of RAM with indentical VMs and then waited. After three or so hours I had saved about 2,500 pages of memory.

To be honest, that’s a pretty disappointing result. 2,5000 4kb pages is only about 10mb of RAM, which isn’t very much at all. Also, three hours is a really long time for our workload, where students often fire up their labs for a couple of hours at a time before shutting them down again. If this was as good as KSM gets, it wasn’t for us.

After some pondering, I realised that KSM is configured by default to not work very well. The default value for pages_to_scan is 100, which means each scan run only inspects about half a megabyte of RAM. It would take a very very long time to scan a modern machine that way. So I tried setting pages_to_scan to 1,000,000,000 instead. One billion is an unreasonably large number for the real world, but hey. You update this number by writing a new value to /sys/kernel/mm/ksm/pages_to_scan.

This time we get a much better result — I launched as many VMs as would fit on each machine, and the sat back and waited (well, went to bed acutally). Again the graph is a bit hard to read, but what it is saying is that after 90 minutes KSM had saved me over 300gb of RAM across the three machines. Its still a little too slow for our workload, but for workloads where the VMs are relatively static that’s a real saving.

Now it should be noted that setting pages_to_scan to 1,000,000,000 comes at a cost — each of these machines now has one of its 48 cores dedicated to scanning memory and deduplicating. For my workload that’s something I am ok with because my workload is not CPU bound, but it might not work for you.

The last week for linux.conf.au 2019 proposals!

Dear humans of the Internet — there is ONE WEEK LEFT to propose talks for linux.conf.au 2019. LCA is one of the world’s best open source conferences, and we’d love to hear you speak!
 
Unsure what to propose? Not sure if your talk is what the conference would normally take? Just want a chat? You’re welcome to reach out to papers-chair@linux.org.au to talk things through.
 
https://linux.conf.au/call-for-papers/

Giving serial devices meaningful names

This is a hack I’ve been using for ages, but I thought it deserved a write up.

I have USB serial devices. Lots of them. I use them for home automation things, as well as for talking to devices such as the console ports on switches and so forth. For the permanently installed serial devices one of the challenges is having them show up in predictable places so that the scripts which know how to drive each device are talking in the right place.

For the trivial case, this is pretty easy with udev:

$  cat /etc/udev/rules.d/60-local.rules
KERNEL=="ttyUSB*", \
    ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6001", \
    ATTRS{serial}=="A8003Ye7", \
    SYMLINK+="radish"

This says for any USB serial device that is discovered (either inserted post boot, or at boot), if the USB vendor and product ID match the relevant values, to symlink the device to “/dev/radish”.

You find out the vendor and product ID from lsusb like this:

$ lsusb
Bus 003 Device 003: ID 0624:0201 Avocent Corp.
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 007 Device 002: ID 0665:5161 Cypress Semiconductor USB to Serial
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 002: ID 0403:6001 Future Technology Devices International, Ltd FT232 Serial (UART) IC
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 009 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 008 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

You can play with inserting and removing the device to determine which of these entries is the device you care about.

So that’s great, until you have more than one device with the same USB serial vendor and product id. Then things are a bit more… difficult.

It turns out that you can have udev execute a command on device insert to help you determine what symlink to create. So for example, I have this entry in the rules on one of my machines:

KERNEL=="ttyUSB*", \
    ATTRS{idVendor}=="067b", ATTRS{idProduct}=="2303", \
    PROGRAM="/usr/bin/usbtest /dev/%k", \
    SYMLINK+="%c"

This results in /usr/bin/usbtest being run with the path of the device file on its command line for every device detection (of a matching device). The stdout of that program is then used as the name of a symlink in /dev.

So, that script attempts to talk to the device and determine what it is — in my case either a currentcost or a solar panel inverter.