How can I check if a URL exists with Django’s validators?

Edit: Please note, this is no longer valid for any version of Django above 1.5

I assume you want to check if the file actually exists, not if there is just an object (which is just a simple if statement)

First, I will recommend always looking through Django's source code because you will find some great code that you could use :)

I assume you want to do this within a template. There is no built-in template tag to validate a URL but you could essentially use that URLValidator class within a template tag to test it. Simply:

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

validate = URLValidator(verify_exists=True)
try:
    validate('http://www.somelink.com/to/my.pdf')
except ValidationError, e:
    print e

The URLValidator class will spit out the ValidationError when it can't open the link. It uses urllib2 to actually open the request so it's not just using basic regex checking (But it also does that.)

You can plop this into a custom template tag, which you will find out how to create in the django docs and off you go.

Hope that is a start for you.


Problem

from django.core.validators import URLValidator says that www.google.ro is invalid. Which is wrong in my point of view. Or at least not enough.

How to solve it?

The clue Is to look at the source code for models.URLField, you will see that it uses forms.FormField as a validator. Which does more than URLValidator from above

Solution

If I want to validate a url like http://www.google.com or like www.google.ro, I would do the following:

from django.forms import URLField

def validate_url(url):
    url_form_field = URLField()
    try:
        url = url_form_field.clean(url)
    except ValidationError:
        return False
    return True

I found this useful. Maybe it helps someone else.


Anything based on the verify_exists parameter to django.core.validators.URLValidator will stop working with Django 1.5 — the documentation helpfully says nothing about this, but the source code reveals that using that mechanism in 1.4 (the latest stable version) leads to a DeprecationWarning (you'll see it has been removed completely in the development version):

if self.verify_exists:
    import warnings
    warnings.warn(
        "The URLField verify_exists argument has intractable security "
        "and performance issues. Accordingly, it has been deprecated.",
        DeprecationWarning
        )

There are also some odd quirks with this method related to the fact that it uses a HEAD request to check URLs — bandwidth-efficient, sure, but some sites (like Amazon) respond with an error (to HEAD, where the equivalent GET would have been fine), and this leads to false negative results from the validator.

I would also (a lot has changed in two years) recommend against doing anything with urllib2 in a template — this is completely the wrong part of the request/response cycle to be triggering potentially long-running operations: consider what happens if the URL does exist, but a DNS problem causes urllib2 to take 10 seconds to work that out. BAM! Instant 10 extra seconds on your page load.

I would say the current best practice for making possibly-long-running tasks like this asynchronous (and thus not blocking page load) is using django-celery; there's a basic tutorial which covers using pycurl to check a website, or you could look into how Simon Willison implemented celery tasks (slides 32-41) for a similar purpose on Lanyrd.