Fully qualified domain name validation

It's harder nowadays, with internationalized domain names and several thousand (!) new TLDs.

The easy part is that you can still split the components on ".".

You need a list of registerable TLDs. There's a site for that:

https://publicsuffix.org/list/effective_tld_names.dat

You only need to check the ICANN-recognized ones. Note that a registerable TLD can have more than one component, such as "co.uk".

Then there's IDN and punycode. Domains are Unicode now. For example,

"xn--nnx388a" is equivalent to "臺灣". Both of those are valid TLDs, incidentally.

For punycode conversion code, see "http://golang.org/src/pkg/net/http/cookiejar/punycode.go".

Checking the syntax of each domain component has new rules, too. See RFC5890 at https://www.rfc-editor.org/rfc/rfc5890

Components can be either A-labels (ASCII only) or Unicode. ASCII labels either follow the old syntax, or begin "xn--", in which case they are a punycode version of a Unicode string.

The rules for Unicode are very complex, and are given in RFC5890. The rules are designed to prevent such things as mixing characters from left-to-right and right-to-left sets.

Sorry there's no easy answer.


This regex is what you want:

(?=^.{1,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)

It match your example domain (groupa-zone1appserver.example.com or cod.eu etc...)

I'll try to explain:

(?=^.{1,254}$) matches domain names (that can begin with any char) that are long between 1 and 254 char, it could be also 5,254 if we assume co.uk is the minimum length.

(^ starting match

(?: define a matching group

(?!\d+\.) the domain name should not be composed by numbers, so 1234.co.uk or abc.123.uk aren't accepted while 1a.ko.uk yes.

[a-zA-Z0-9_\-] the domain names should be composed by words with only a-zA-Z0-9_-

{1,63} the length of any domain level is maximum 63 char, (it could be 2,63)

+ and

(?:[a-zA-Z]{2,})$) the final part of the domain name should not be followed by any other word and must be composed of a word minimum of 2 char a-zA-Z


We use this regex to validate domains which occur in the wild. It covers all practical use cases I know of. New ones are welcome. According to our guidelines it avoids non-capturing groups and greedy matching.

^(?!.*?_.*?)(?!(?:[\w]+?\.)?\-[\w\.\-]*?)(?![\w]+?\-\.(?:[\w\.\-]+?))(?=[\w])(?=[\w\.\-]*?\.+[\w\.\-]*?)(?![\w\.\-]{254})(?!(?:\.?[\w\-\.]*?[\w\-]{64,}\.)+?)[\w\.\-]+?(?<![\w\-\.]*?\.[\d]+?)(?<=[\w\-]{2,})(?<![\w\-]{25})$

Proof and explanation: https://regex101.com/r/FLA9Bv/40

There're two approaches to choose from when validating domains.

By-the-books FQDN matching (theoretical definition, rarely encountered in practice):

  • max 253 character long (as per RFC-1035/3.1, RFC-2181/11)
  • max 63 character long per label (as per RFC-1035/3.1, RFC-2181/11)
  • any characters are allowed (as per RFC-2181/11)
  • TLDs cannot be all-numeric (as per RFC-3696/2)
  • FQDNs can be written in a complete form, which includes the root zone (the trailing dot)

Practical / conservative FQDN matching (practical definition, expected and supported in practice):

  • by-the-books matching with the following exceptions/additions
  • valid characters: [a-zA-Z0-9.-]
  • labels cannot start or end with hyphens (as per RFC-952 and RFC-1123/2.1)
  • TLD min length is 2 character, max length is 24 character as per currently existing records
  • don't match trailing dot

The regex above contains both by-the-books and practical rules.


(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}$)

regex is always going to be at best an approximation for things like this, and rules change over time. the above regex was written with the following in mind and is specific to hostnames-

Hostnames are composed of a series of labels concatenated with dots. Each label is 1 to 63 characters long, and may contain:

  • the ASCII letters a-z (in a case insensitive manner),
  • the digits 0-9,
  • and the hyphen ('-').

Additionally:

  • labels cannot start or end with hyphens (RFC 952)
  • labels can start with numbers (RFC 1123)
  • max length of ascii hostname including dots is 253 characters (not counting trailing dot) (http://blogs.msdn.com/b/oldnewthing/archive/2012/04/12/10292868.aspx)
  • underscores are not allowed in hostnames (but are allowed in other DNS types)

some assumptions:

  • TLD is at least 2 characters and only a-z
  • we want at least 1 level above TLD

results: valid / invalid

  • 911.gov - valid
  • 911 - invalid (no TLD)
  • a-.com - invalid
  • -a.com - invalid
  • a.com - valid
  • a.66 - invalid
  • my_host.com - invalid (undescore)
  • typical-hostname33.whatever.co.uk - valid

EDIT: John Rix provided an alternative hack of the regex to make the specification of a TLD optional:

(?=^.{1,253}$)(^(((?!-)[a-zA-Z0-9-]{1,63}(?<!-))|((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63})$)
  • 911 - valid
  • 911.gov - valid

EDIT 2: someone asked for a version that works in js. the reason it doesn't work in js is because js does not support regex look behind. specifically, the code (?<!-) - which specifies that the previous character cannot be a hyphen.

anyway, here it is rewritten without the lookbehind - a little uglier but not much

(?=^.{4,253}$)(^((?!-)[a-zA-Z0-9-]{0,62}[a-zA-Z0-9]\.)+[a-zA-Z]{2,63}$)

you could likewise make a similar replacement on John Rix's version.

EDIT 3: if you want to allow trailing dots - which is technically allowed:

(?=^.{4,253}\.?$)(^((?!-)[a-zA-Z0-9-]{1,63}(?<!-)\.)+[a-zA-Z]{2,63}\.?$)

I wasn't familiar with trailing dot syntax till @ChaimKut pointed them out and I did some research

  • http://dns-sd.org./TrailingDotsInDomainNames.html
  • https://jdebp.eu./FGA/web-fully-qualified-domain-name.html

Using trailing dots however seems to cause somewhat unpredictable results in the various tools I played with so I would be advise some caution.

Tags:

Regex

Bash

Fqdn