Collisions in MD5 sums

This is kind of a big deal. Cryptographic has functions are used in a lot of computer science circles to take a large document and turn it into a relatively small description of the document. The transformation has a couple of interesting properties:

  • It’s one way — which means that I can know that I have your document, without checking the contents. There are secure file systems out there that when you give it a file give you back the ID for the file, and that’s how you access it in the future. Don’t know the ID? You can’t possibly have seen the file.
  • They’re meant to be unique — you can’t possibly have no overlap between bazillions of documents and the comparatively few IDs available, but it’s meant to be very hard to get two documents with the same ID. This is commonly used for CD downloads for instance where people want to be sure that you got the file intended completely, or to make sure that you’re not storing information twice. EMC for instance has an email storage system which only saves an email if the MD5 ID is new, otherwise it must be a duplicate.

It turns out that reuse of IDs (called collisions) might be a lot easier to find that people thought. There’s an interesting paper which shows a trivial example of a collision. Bruce Schneier has some thoughts on the issue.