Quick hack: extracting the contents of a Docker image to disk

Hello! Please note I’ve written a little python tool called Occy Strap which makes this a bit easier, and can do some fancy things around importing and exporting multiple images. You might want to read about it?

For various reasons, I wanted to inspect the contents of a Docker image without starting a container. Docker makes it easy to get an image as a tar file, like this:

docker save -o foo.tar image

But if you extract that tar file you’ll find a configuration file and manifest as JSON files, and then a series of tar files, one per image layer. You use the manifest to determine in what order you extract the tar files to build the container filesystem.

That’s fiddly and annoying. So I wrote this quick python hack to extract an image tarball into a directory on disk that I could inspect:


# Call me like this:
#  docker-image-extract tarfile.tar extracted

import tarfile
import json
import os
import sys

image_path = sys.argv[1]
extracted_path = sys.argv[2]

image = tarfile.open(image_path)
manifest = json.loads(image.extractfile('manifest.json').read())

for layer in manifest[0]['Layers']:
    print('Found layer: %s' % layer)
    layer_tar = tarfile.open(fileobj=image.extractfile(layer))

    for tarinfo in layer_tar:
        print('  ... %s' % tarinfo.name)
        if tarinfo.isdev():
            print('  --> skip device files')

        dest = os.path.join(extracted_path, tarinfo.name)
        if not tarinfo.isdir() and os.path.exists(dest):
            print('  --> remove old version of file')

        layer_tar.extract(tarinfo, path=extracted_path)

Hopefully that’s useful to someone else (or future me).

Learning from the mistakes that even big projects make

The following is a blog post version of a talk presented at pyconau 2018. Slides for the presentation can be found here (as Microsoft powerpoint, or as PDF), and a video of the talk (thanks NextDayVideo!) is below:


OpenStack is an orchestration system for setting up virtual machines and associated other virtual resources such as networks and storage on clusters of computers. At a high level, OpenStack is just configuring existing facilities of the host operating system — there isn’t really a lot of difference between OpenStack and a room full of system admins frantically resolving tickets requesting virtual machines be setup. The only real difference is scale and predictability.

To do its job, OpenStack needs to be able to manipulate parts of the operating system which are normally reserved for administrative users. This talk is the story of how OpenStack has done that thing over time, what we learnt along the way, and what I’d do differently if I had my time again. Lots of systems need to do these things, so even if you never use OpenStack hopefully there are things to be learnt here.

Continue reading “Learning from the mistakes that even big projects make”

I think I found a bug in python’s unittest.mock library

Mocking is a pretty common thing to do in unit tests covering OpenStack Nova code. Over the years we’ve used various mock libraries to do that, with the flavor de jour being unittest.mock. I must say that I strongly prefer unittest.mock to the old mox code we used to write, but I think I just accidentally found a fairly big bug.

The problem is that python mocks are magical. Its an object where you can call any method name, and the mock will happily pretend it has that method, and return None. You can then later ask what “methods” were called on the mock.

However, you use the same mock object later to make assertions about what was called. Herein is the problem — the mock object doesn’t know if you’re the code under test, or the code that’s making assertions. So, if you fat finger the assertion in your test code, the assertion will just quietly map to a non-existent method which returns None, and your code will pass.

Here’s an example:


from unittest import mock

class foo(object):
    def dummy(a, b):
        return a + b

@mock.patch.object(foo, 'dummy')
def call_dummy(mock_dummy):
    f = foo()
    f.dummy(1, 2)

    print('Asserting a call should work if the call was made')
    mock_dummy.assert_has_calls([mock.call(1, 2)])
    print('Assertion for expected call passed')

    print('Asserting a call should raise an exception if the call wasn\'t made')
    mock_worked = False
        mock_dummy.assert_has_calls([mock.call(3, 4)])
    except AssertionError as e:
        mock_worked = True
        print('Expected failure, %s' % e)

    if not mock_worked:
        print('*** Assertion should have failed ***')

    print('Asserting a call where the assertion has a typo should fail, but '
    mock_worked = False
        mock_dummy.typo_assert_has_calls([mock.call(3, 4)])
    except AssertionError as e:
        mock_worked = True
        print('Expected failure, %s' % e)

    if not mock_worked:
        print('*** Assertion should have failed ***')

if __name__ == '__main__':

If I run that code, I get this:

$ python3 mock_assert_errors.py
Asserting a call should work if the call was made
Assertion for expected call passed

Asserting a call should raise an exception if the call wasn't made
Expected failure, Calls not found.
Expected: [call(3, 4)]
Actual: [call(1, 2)]

Asserting a call where the assertion has a typo should fail, but doesn't
*** Assertion should have failed ***
[call(1, 2), call.typo_assert_has_calls([call(3, 4)])]

So, we should have been told that typo_assert_has_calls isn’t a thing, but we didn’t notice because it silently failed. I discovered this when I noticed an assertion with a (smaller than this) typo in its call in a code review yesterday.

I don’t really have a solution to this right now (I’m home sick and not thinking straight), but it would be interesting to see what other people think.

Python3 venvs for people who are old and grumpy

I’ve been using virtualenvwrapper to make venvs for python2 for probably six or so years. I know it, and understand it. Now some bad man (hi Ramon!) is making me do python3, and virtualenvwrapper just isn’t a thing over there as best as I can tell.

So how do I make a venv? Its really not too bad…

First, install the dependencies:

    git clone git://github.com/yyuu/pyenv.git .pyenv
    echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
    echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
    echo 'eval "$(pyenv init -)"' >> ~/.bashrc
    git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv
    source ~/.bashrc

Now to make a venv, do something like this (in this case, infrasot is the name of the venv):

    mkdir -p ~/.virtualenvs/pyenv-infrasot
    cd ~/.virtualenvs/pyenv-infrasot
    pyenv virtualenv system infrasot

You can see your installed venvs like this:

    $ pyenv versions
    * system (set by /home/user/.pyenv/version)

Where system is the system installed python, and not a venv. To activate and deactivate the venv, do this:

    $ pyenv activate infrasot
    $ ... stuff you're doing ...
    $ pvenv deactivate

I’ll probably write wrappers at some point so that this looks like virtualenvwrapper, but its good enough for now.

Multiple file support with scp

Paramiko doesn’t provide a scp implementation, so I’ve been using my own for a while.

http://blogs.sun.com/janp/entry/how_the_scp_protocol_works (link now unfortunately dead) provides good documentation about the scp protocol, but it missed out on one detail I needed — how to send more than one file in a given session. In the end I implemented a simple scp logger to see what the protocol was doing during the copying of files. My logger said this:

>>> New command invocation: /usr/bin/scp -d -t /tmp
O: \0
I: C0644 21 a\n
O: \0
I: file a file a file a\n\0
O: \0
I: C0644 21 b\n
O: \0
I: file b file b file b\n\0
O: \0
>>>stdin closed
>>> stdout closed
>>> stderr closed

It turns out its important to wait for those zeros by the way. So, here’s my implementation of the protocol to send more than one file. Turning this into paramiko code is left as an exercise for the reader.


import fcntl
import os
import select
import string
import subprocess
import sys
import traceback

def printable(s):
  out = ''

  for c in s:
    if c == '\n':
      out += '\\n'
    elif c in string.printable:
      out += c
      out += '\\%d' % ord(c)

  return out

  dialog = ['C0644 21 c\n',
            'file c file c file c\n\0',
            'C0644 21 d\n',
            'file d file d file d\n\0']

  proc = subprocess.Popen(['scp', '-v', '-d', '-t', '/tmp'],

  r = [proc.stdout, proc.stderr]
  w = []
  e = [proc.stdout, proc.stderr]

  fl = fcntl.fcntl(proc.stdout, fcntl.F_GETFL)
  fcntl.fcntl(proc.stdout, fcntl.F_SETFL, fl | os.O_NONBLOCK)
  fl = fcntl.fcntl(proc.stderr, fcntl.F_GETFL)
  fcntl.fcntl(proc.stderr, fcntl.F_SETFL, fl | os.O_NONBLOCK)

  stdin_closed = False
  while proc.returncode is None:
    (readable, _, errorable) = select.select(r, w, e)

    for flo in readable:
      if flo == proc.stdout:
        d = os.read(proc.stdout.fileno(), 1024)
        if len(d) > 0:
          sys.stdout.write('O: %s\n' % printable(d))

          if len(dialog) > 0:
            sys.stdout.write('I: %s\n' % printable(dialog[0]))
            os.write(proc.stdin.fileno(), dialog[0])
            dialog = dialog[1:]

          if len(dialog) == 0 and not stdin_closed:
            sys.stdout.write('>>> stdin closed\n')
            stdin_closed = True

          sys.stdout.write('>>> stdout closed\n')

      elif flo == proc.stderr:
        d = os.read(proc.stderr.fileno(), 1024)
        if len(d) > 0:
          sys.stdout.write('E: %s\n' % printable(d))
          sys.stdout.write('>>> stderr closed\n')

        sys.stdout.write('>>> Unknown readable: %s: %s\n'
                         %(repr(flo), flo.read()))

    for flo in errorable:
      sys.stdout.write('>>> Error on %s\n' % repr(flo))


  print '#: %s' % proc.returncode

  exc = sys.exc_info()
  for tb in traceback.format_exception(exc[0], exc[1], exc[2]):
    print tb
    del tb

Python effective TLD library bug fix

Some cool people commented on bugs in the etld library in the previous post about it. I’ve taken the opportunity to fix the bug, and a new release is now available at http://www.stillhq.com/python/etld/etld.py. If you’ve got specific examples of domains which either didn’t work previously, or don’t work now, let me know. I want to add unit tests to this code ASAP.

Python effective TLD library update

The effective TLD library is now being used for a couple of projects of mine, but I’ve had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn’t a match. That’s expensive.

I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That’s a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches…. In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.

I’ve updated the code at http://www.stillhq.com/python/etld/etld.py.

Python effective TLD library

I had a need recently for a library which would take a host name and return the domain-specific portion of the name, and the effective TLD being used. “Effective TLD” is a term coined by the Mozilla project for something which acts like a TLD. For example, .com is a TLD and has domains allocated under it. However, .au is a TLD with no domains under it. The effective TLDs for the .au domain are things like .com.au and .edu.au. Whilst there are libraries for other languages, I couldn’t find anything for python.

I therefore wrote one. Its very simple, and not optimal. For example, I could do most of the processing with a single regexp if python supported more than 100 match groups in a regexp, but it doesn’t. I’m sure I’ll end up revisiting this code sometime in the future. Additionally, the code ended up being much easier to write than I expected, mainly because the Mozilla project has gone to the trouble of building a list of rules to determine the effective TLD of a host name. This is awesome, because it saved me heaps and heaps of work.

The code is at http://www.stillhq.com/python/etld/etld.py if you’re interested.

Calculating a SSH host key with paramiko

I needed to compare a host key from something other than a known_hosts file with what paramiko reports as part of the SSH connection today. If you must know, the host keys for these machines are retrieved a XMLRPC API… It turned out to be a lot easier than I thought. Here’s how I produced the host key entry as it appears in that API (as well as in the known_hosts file):

    # A host key calculation example for Paramiko.
    # Args:
    #   1: hostname
    import base64
    import os
    import paramiko
    import socket
    import sys
    # Socket connection to remote host
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((sys.argv[1], 22))
    # Build a SSH transport
    t = paramiko.Transport(sock)
    key = t.get_remote_server_key()
    print '%s %s' %(key.get_name(),
                    base64.encodestring(key.__str__()).replace('\n', ''))

Note that I could also have constructed a paramiko key object based on the output of the XMLRPC API and then compared those two objects, but I prefer the human readable strings.

Killing a blocking thread in python?

It seems that there is no way of killing a blocking thread in python? The standard way of implementing thread death seems to be to implement an exit() method on the class which is the thread, and then call that when you want the thread to die. However, if the run() method of the thread class is blocking when you call exit(), then the thread doesn’t get killed. I can’t find a way of killing these threads cleanly on Linux — does anyone have any hints?