python domain name split name and extension

Wow, there are a lot of bad answers here. You can only do this if you know what's on the public suffix list. If you are using split or a regex or something else, you're doing this wrong.

Luckily, this is python, and there's a library for this: https://pypi.python.org/pypi/tldextract

From their readme:

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

ExtractResult is a namedtuple. Makes it pretty easy.

The advantage of using a library like this is that they will keep up with the additions to the public suffix list so you don't have to.


Depending on your application, be a little wary of simply taking the part following the last '.'. That works fine for .com, .net, .org, etc but will likely fall over for many County Code TLDs. E.g. bit.ly or google.co.uk.

(By which I mean 'bit.ly' probably prefer to be identified including the .ly TLD whereas google probably don't want to be identified with a spurious .co remainder. Whether that's important will obviously depend on what you're doing).

In those complicated cases ... well, you've got your work cut out I suspect!

A robust answer will probably depend on how you're gathering / storing your domains and what you really want back as the 'name'.

For example, if you've got a set of domain names, with no subdomain information, then you could do the opposite of what's suggested above and simply take the first part off:

>>> "stackoverflow.com".split('.')[0]
'stackoverflow'

In general, it's not easy to work out where the user-registered bit ends and the registry bit begins. For example: a.com, b.co.uk, c.us, d.ca.us, e.uk.com, f.pvt.k12.wy.us...

The nice people at Mozilla have a project dedicated to listing domain suffixes under which the public can register domains: http://publicsuffix.org/

Tags:

Python

String