Some cool people commented on bugs in the etld library in the previous post about it. I’ve taken the opportunity to fix the bug, and a new release is now available at http://www.stillhq.com/python/etld/etld.py. If you’ve got specific examples of domains which either didn’t work previously, or don’t work now, let me know. I want to add unit tests to this code ASAP.
The effective TLD library is now being used for a couple of projects of mine, but I’ve had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn’t a match. That’s expensive.
I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That’s a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches…. In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.
I’ve updated the code at http://www.stillhq.com/python/etld/etld.py.
I had a need recently for a library which would take a host name and return the domain-specific portion of the name, and the effective TLD being used. “Effective TLD” is a term coined by the Mozilla project for something which acts like a TLD. For example, .com is a TLD and has domains allocated under it. However, .au is a TLD with no domains under it. The effective TLDs for the .au domain are things like .com.au and .edu.au. Whilst there are libraries for other languages, I couldn’t find anything for python.
I therefore wrote one. Its very simple, and not optimal. For example, I could do most of the processing with a single regexp if python supported more than 100 match groups in a regexp, but it doesn’t. I’m sure I’ll end up revisiting this code sometime in the future. Additionally, the code ended up being much easier to write than I expected, mainly because the Mozilla project has gone to the trouble of building a list of rules to determine the effective TLD of a host name. This is awesome, because it saved me heaps and heaps of work.
The code is at http://www.stillhq.com/python/etld/etld.py if you’re interested.