Dealing with remote HTTP servers with buggy chunking implementations

HTTP 1.1 implements chunking as a way of servers telling clients how much content is left for a given request, which enables you to send more than one piece of content in a given HTTP connection. Unfortunately for me, the site I was trying to access has a buggy chunking implementation, and that causes the somewhat fragile python urllib2 code to throw an exception:

Traceback (most recent call last):
  File "./mythingie.py", line 55, in ?
    xml = remote.readlines()
  File "/usr/lib/python2.4/socket.py", line 382, in readlines
    line = self.readline()
  File "/usr/lib/python2.4/socket.py", line 332, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib/python2.4/httplib.py", line 460, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked
    chunk_left = int(line, 16)
ValueError: invalid literal for int():

I muttered about this earlier today, including finding the bug tracking the problem in pythonistan. However, finding the will not fix bug wasn’t satisfying enough…

It turns out you can just have urllib2 lie to the server about what HTTP version it talks, and therefore turn off chunking. Here’s my sample code for how to do that:

import httplib
import urllib2

class HTTP10Connection(httplib.HTTPConnection):
  """HTTP10Connection -- a HTTP connection which is forced to ask for HTTP
     1.0
  """

  _http_vsn_str = 'HTTP/1.0'

class HTTP10Handler(urllib2.HTTPHandler):
  """HTTP10Handler -- don't use HTTP 1.1"""

  def http_open(self, req):
    return self.do_open(HTTP10Connection, req)

// ...

  request = urllib2.Request(feed)
  request.add_header('User-Agent', 'mythingie')
  opener = urllib2.build_opener(HTTP10Handler())

  remote = opener.open(request)
  content = remote.readlines()
  remote.close()

I hereby declare myself Michael Still, bringer of the gross python hacks.