What is the RFC compliant and working regular expression to check if a string is a valid URL

Well, if you look at it, the specification is broken down into "chunks". That's how I'd suggest building the regex so that it's easier to read, more maintainable and understandable. So, the parts of the regex are (Optional are italicized):

  1. Scheme
  2. Username/Password
  3. Domain Or IP Address
  4. Port
  5. Path
  6. Query
  7. Anchor

So, we need to build a regex sub-part for each.

  1. Scheme:

    $scheme = "[a-z][a-z0-9+.-]*";
    
  2. Username/Password:

    $username = "([^:@/](:[^:@/])?@)?";
    
  3. Domain or IP Address:

    Now, we need to build up the 3 possible hosts:

    1. Domain Name
    2. IPv4
    3. IPv6

    Domain Name:

    $segment = "([a-z][a-z0-9-]*?[a-z0-9])";
    $domain = "({$segment}\.)*{$segment}";
    

    IPv4:

    $segment = "([0|1][0-9]{2}|2([0-4][0-9]|5[0-5]))";
    $ipv4 = "({$segment}\.{$segment}\.{$segment}\.{$segment})";
    

    IPv6:

    $block = "([a-f0-9]{0,4})";
    $rawIpv6 = "({$block}:){2,8}";
    $ipv4sub = "(::ffff:{$ipv4})";
    $ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
    

    Finally:

    $host = "($domain|$ipv4|$ipv6)";
    
  4. Port:

    $port = "(:[\d]{1,5})?";
    
  5. Path:

    $path = "([^?;\#]*)?";
    
  6. Query:

    $query = "(\?[^\#;]*)?";
    
  7. Anchor:

    $anchor = "(\#.*)?";
    

And the final regex:

$regex = "#^{$scheme}://{$username}{$host}{$port}(/{$path}{$query}{$anchor}|)$#i";

Note that the / is in the regex, and not the path part since path can be empty.

Also note that I have not tested this. It should work, but definitely it needs confirming that each part is correct (as for what to expect in the url).

Also also note that this is only one way of doing it. You could use other tools that don't need regexp or a library or framework that'll be easier to maintain in the long run.

Best of luck


After reading RFC 3986, I have to say I was wrong. That regexp is fully working (that I know). First mistake I had was syntax of IPv6 addresesses, they are put around [], and second was about example.org: (note trailing double dot :). But as the RFC says scheme can have dots in it, so it's also valid.

So that's valid RFC way to do it, but people will usually (as I will) need to modify it to only accept some schemas.

Tags:

Php

Url

Regex