Python effective TLD library update

Share

The effective TLD library is now being used for a couple of projects of mine, but I’ve had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn’t a match. That’s expensive.

I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That’s a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches…. In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.

I’ve updated the code at http://www.stillhq.com/python/etld/etld.py.

Share

Is there any way to access the match text in MySQL rlike selects?

Share

Hi. I am doing a select like this in MySQL 5:

    select * from foo where bar rlike '(.*),(.*)';
    

The specific example here is made up. Anyway, I’d like to be able to get to the matched text from bar, like I can with various languages regexp libraries. Is this functionality exposed at all in MySQL? I’ve looked at the docs and can’t see any indication that it is, so this might just be wishful thinking.

Share