Writing my own KVM client in python

I was talking to a friend the other day about our shared mutual appreciation of virtio-vsock, and it made me wonder something. How do virtual machines on Linux actually work? I know it involves qemu and the kernel’s KVM virtual machine implementation, but exactly how do they interact? How does the kernel get qemu to do emulation tasks as required?

qemu is several things hanging out together in a trench coat, but one of those things is software which can configure Linux’s built-in KVM virtual machine functionality to run a virtual machine, and then handle emulation of the devices that virtual machine is attached to which cannot be represented with actual physical hardware. This part of qemu is called a “KVM client” in the Linux kernel documentation. Its called that because if we ignore the emulation part for now it is just literally a client calling established APIs to the Linux kernel.

I was mildly surprised to find that this topic is reasonably well documented and actually not particularly hard to implement once you know what to look for. I should also admit that I have been playing with AI models recently, and Anthropic’s Sonnet 4.5 was actually quite helpful. I don’t know how other people use these models, but I’ve fallen into two patterns of usage. In this case I used Sonnet like a search engine, with prompts like:

Please help me come up with a good project name which is a pun on python and kvm.

It came up with “pypervisor”, which I think is actually a pretty good name even though it does not include KVM in the pun at all. Or this example:

What do I pass to libc.ioctl when there is no data to pass?

I would then use the results of those “searches” to write my code.

The other method of usage is more trusting, where I have very recently started using Claude Code to generate code for me. So far I’ve only done this for private projects I will never release publicly, and its definitely a great way to churn out lots of code quickly, although I have had problems with that code not being correct or hallucinating APIs to call. This mode of operation is also noticeably worse as a way of learning, as you’re just presented with a finished product without any of the iterative process to get there.

I have also rapidly found that my biggest concern with these models is that they don’t provide supporting references for their statements, which is something we expect middle school kids to figure out. I would be much more comfortable if it could provide links to sources it used, both so I could validate what it said but also so the underlying authors could get more credit and incentive to keep writing. Google’s “AI Mode” sort of seems to do this, so it doesn’t seem completely impossible.

Instead, I’ve been using prompts like this:

Can you recommend any web pages which demonstrate using python to create a KVM virtual machine?

Or, from another session the other day on an unrelated topic:

Please provide me with some links to web pages which further discuss this topic.

So having said all that, the following references were quite helpful:

  • This YouTube video entitled “How to write your own KVM client from scratch – Murilo Opsfelder Araújo” is a conference talk from linuxdev-br and kicked off this whole tangent, so deserves credit.
  • That said, he heavily references this LWN article from 2015, which is actually super helpful.
  • This 11 minute conference talk is both shorter as well as written by the author of the LWM article. Its quite interesting, but skips a lot of the lower level details I was specifically interested in. It also discusses why you might want to do something like this, apart from mere curiosity. The answer comes basically down to being able to run code in a tightly constrained environment.
  • This seven year old git repository implements pretty much what I wrote, but much better. Its even also in python. That said, because its a lot more competent, its also a lot longer which makes it harder for newcomers to comprehend.
  • This blog post also covers the material reasonably well.

In the end, it turns out the Kernel API for running a virtual machine with KVM really isn’t that bad. Especially once you’ve accepted that the API doesn’t look at all like what a userspace developer would expect the API to look like. Instead of making calls to a library like glibc, you need to open /dev/kvm, and then use a series of ioctl calls to configure your virtual machine. Objects you create along the way are tracked as file descriptors, which makes sense given ioctls want to happen on files.

I don’t have a strong history here on providing tutorial content, so I am not entirely sure how to present my example while also producing a single block of usable code. While the entire project is on github, I am including the main body of the code here for ease of reference. For that main body I have opted to just putting in a lot of comments. Feedback on this approach is welcome.

#!/usr/bin/python3

# A not very good KVM client / virtual machine manager written in python.
# Based heavily on the excellent https://lwn.net/Articles/658511/.
# Development was assisted by Claude Sonnet 4.5, and US Intellectual
# Property law.


import ctypes
import fcntl
import mmap
import sys

from displayhelpers import *
from exitcodes import *
from ioctls import *
from structs import *


# A single 4kb page
MEM_SIZE = 0x1000


def main():
    # Open the KVM device file. This gives us a top level reference to the
    # KVM API which we can then make global calls against.
    with open('/dev/kvm', 'rb+', buffering=0) as kvm:
        try:
            # Check that the API is a supported version. This should
            # basically never fail on a modern kernel.
            api_version = fcntl.ioctl(kvm, KVM_GET_API_VERSION)
            print(f'KVM API version: {api_version}')
            if api_version != 12:
                print(f'KVM API version {api_version} was unexpected')
                sys.exit(1)
        except OSError as e:
            print(
                f'Failed to lookup KVM API version: {e.errno} - {e.strerror}'
                )
            sys.exit(1)

        # Create a VM file descriptor. This is the "object" which tracks the
        # virtual machine we are creating.
        print()
        try:
            vm = fcntl.ioctl(kvm, KVM_CREATE_VM)
            print(f'VM file descriptor: {vm}')
        except OSError as e:
            print(f'Failed to create a VM: {e.errno} - {e.strerror}')
            sys.exit(1)

        # mmap memory for the VM to use. Sonnet 4.5 alleges that we need to
        # use mmap here instead of just allocating a largeish byte array in
        # native python for a few reasons: the allocation needs to be
        # page-aligned; mmap'ed memory can be zero-copied into the virtual
        # machine, a python array cannot; MAP_SHARED means our python process
        # can inspect the state of the virtual machine's memory; python
        # memory allocations are not at a stable location -- python might
        # rearrange things. So yeah, those seem like reasons to me.
        mem = mmap.mmap(
            -1,
            MEM_SIZE,
            prot=mmap.PROT_READ | mmap.PROT_WRITE,
            flags=mmap.MAP_SHARED | mmap.MAP_ANONYMOUS,
            offset=0
        )
        mem_buf = (ctypes.c_char * len(mem)).from_buffer(mem)
        mem_addr = ctypes.addressof(mem_buf)
        print(f'VM memory page is at 0x{mem_addr:x}')

        # This is the data structure we're going to pass to the kernel to
        # tell if about all this memory we have allocated.
        region_s = kvm_userspace_memory_region_t()
        region_s.slot = 0
        region_s.flags = 0
        region_s.guest_phys_addr = 0
        region_s.memory_size = MEM_SIZE
        region_s.userspace_addr = mem_addr

        try:
            # This dance gives us the address of the data structure, which is
            # what the kernel is expecting.
            region_bytes = ctypes.string_at(
                ctypes.addressof(region_s), ctypes.sizeof(region_s))
            fcntl.ioctl(vm, KVM_SET_USER_MEMORY_REGION, region_bytes)
        except OSError as e:
            print(f'Failed to map memory into VM: {e.errno} - {e.strerror}')
            sys.exit(1)

        # Add a vCPU to the VM. The vCPU is another object we can do things
        # to later.
        try:
            # The zero here is the index of the vCPU, this one being of
            # course our first.
            vcpu = fcntl.ioctl(vm, KVM_CREATE_VCPU, 0)
            print(f'vCPU file descriptor: {vcpu}')
        except OSError as e:
            print(f'Failed to create a vCPU: {e.errno} - {e.strerror}')
            sys.exit(1)

        # mmap the CPU state structure from the kernel to userspace. We need
        # to lookup the size of the structure, and the LWN article notes:
        # "Note that the mmap size typically exceeds that of the kvm_run
        # structure, as the kernel will also use that space to store other
        # transient structures that kvm_run may point to".
        try:
            kvm_run_size = fcntl.ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE)
        except OSError as e:
            print(
                f'Failed to lookup kvm_run struct size: {e.errno} - '
                f'{e.strerror}'
            )
            sys.exit(1)

        print()
        print(f'The KVM run structure is {kvm_run_size} bytes')

        kvm_run = mmap.mmap(
            vcpu,
            kvm_run_size,
            prot=mmap.PROT_READ | mmap.PROT_WRITE,
            flags=mmap.MAP_SHARED,
            offset=0
        )
        kvm_run_s = kvm_run_t.from_buffer(kvm_run)
        kvm_run_addr = ctypes.addressof(kvm_run_s)
        print(f'vCPU KVM run struture is at 0x{kvm_run_addr:x}')
        print()
        print(pretty_print_struct(kvm_run_s))
        print()

        # Read the initial state of the vCPU special registers
        sregs = kvm_sregs_t()
        fcntl.ioctl(vcpu, KVM_GET_SREGS, sregs)
        print('Initial vCPU special registers state')
        print()
        print(pretty_print_sregs(sregs))
        print()

        # Setup sregs per the LWN article. cs by default points to the
        # reset vector at 16 bytes below the top of memory. We want to start
        # at the begining of memory instead.
        sregs.cs.base = 0
        sregs.cs.selector = 0
        fcntl.ioctl(vcpu, KVM_SET_SREGS, sregs)

        # Read back to validate the change
        sregs = kvm_sregs_t()
        fcntl.ioctl(vcpu, KVM_GET_SREGS, sregs)
        print('CS updated vCPU special registers state')
        print()
        print(pretty_print_sregs(sregs))
        print()

        # Read the initial state of the vCPU standard registers
        regs = kvm_regs_t()
        fcntl.ioctl(vcpu, KVM_GET_REGS, regs)
        print('Initial vCPU standard registers state')
        print()
        print(pretty_print_struct(regs))
        print()

        # Setup regs per the LWN article. We set the instruction pointer (IP)
        # to 0x0 relative to the CS at 0, set RAX and RBX to 2 each as our
        # initial inputs to our program, and set the flags to 0x2 as this is
        # documented as the start state of the CPU. Note that the LWN article
        # originally had the code at 0x1000, which is super confusing because
        # that's outside the 4kb of memory we actually allocated.
        regs.rip = 0x0
        regs.rax = 2
        regs.rbx = 2
        regs.rflags = 0x2
        fcntl.ioctl(vcpu, KVM_SET_REGS, regs)

        # Read back to validate the change
        regs = kvm_regs_t()
        fcntl.ioctl(vcpu, KVM_GET_REGS, regs)
        print('Updated vCPU standard registers state')
        print()
        print(pretty_print_struct(regs))
        print()

        # Set the memory to contain our simple demo program, which is from
        # the LWN article again. Its important to note that the memory we
        # mapped earlier is accessible to _both_ this userspace program and
        # the vCPU, so we can totally poke around in it if we want.
        program = bytes([
            0xba,       # mov $0x3f8, %dx
            0xf8,
            0x03,
            0x00,       # add %bl, %al
            0xd8,
            0x04,       # add $'0', %al
            ord('0'),
            0xee,       # out %al, (%dx)
            0xb0,       # mov $'\n', %al
            ord('\n'),
            0xee,       # out %al, (%dx)
            0xf4,       # hlt
        ])
        mem[0:len(program)] = program

        # And we now enter into the VMM main loop, which is where we sit for
        # the lifetime of the virtual machine. Each return from the ioctl is
        # called a "VM Exit" and indicates that a protection violation in
        # the vCPU has signalled a request for us to do something.
        while True:
            print('Running...')
            fcntl.ioctl(vcpu, KVM_RUN)
            kvm_run_s = kvm_run_t.from_buffer(kvm_run)
            exit_reason = VM_EXIT_CODES.get(
                kvm_run_s.exit_reason,
                f'Unknown exit reason: {kvm_run_s.exit_reason}'
                )
            print(f'VM exit: {exit_reason}')
            print()

            match exit_reason:
                case 'KVM_EXIT_HLT':
                    print('Program complete (halted)')
                    sys.exit(0)

                # Claude had the direction values backwards. qemu definitely
                # agrees with this though.
                # https://github.com/qemu/qemu/blob/master/linux-headers/linux/kvm.h#L245
                case 'KVM_EXIT_IO':
                    io = kvm_run_s.exit_reasons.io
                    print(pretty_print_struct(io))
                    print()

                    # We only handle input to ioport 0x03f8 for now
                    if io.direction == KVM_EXIT_IO_OUT and io.port == 0x3f8:
                        data_ptr = ctypes.addressof(kvm_run_s) + io.data_offset
                        data = ctypes.string_at(data_ptr, io.size)
                        try:
                            data_str = data.decode('ascii')
                        except:
                            data_str = 'failed to decode ASCII'
                        print(f'Output: {data}, {data_str}')

                    else:
                        print('Not yet implemented...')
                        sys.exit(1)

                case 'KVM_EXIT_SHUTDOWN':
                    print('VM shutdown')
                    sys.exit(0)

                case 'KVM_EXIT_INTERNAL_ERROR':
                    print('Internal errors are probably bad?')
                    sys.exit(1)

                case _:
                    print(f'Unhandled VM exit: {exit_reason}')
                    sys.exit(1)


if __name__ == '__main__':
    main()

I hope someone else finds this interesting too.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.