Using canonical links to keep a site out of Google search results

Use noindex to keep pages out of Google’s index

The only correct way to keep results out of Google’s index is to use noindex.

At the risk of being pendantic, Google’s (or any search engine’s) search results are composed of items that have been indexed. Googlebot honors a couple of ways to instruct it to omit a page from its index. If you don’t use these methods, don’t be surpised if your page ends up in the search results.

So the short answer is yes, use noindex to keep things out of the index. Or better yet, use the X-Robots-Tag HTTP header (see below).

Don’t use robots.txt for this

robots.txt prevents pages from being spidered which is a related, but distinct, concept to indexing. Many non-spidered pages that have strong backlinks can and do rank well in the Google search results.

You may have seen some, they look like the example at the bottom of this Moz.com article.

Google explains:

robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.

Canonical URL’s don’t exclude anything from Google’s index

Canonical URL’s tell Google that the referring and referred pages represent the same content, for “consolidating link signals for the duplicate or similar content” — that is, they help with SEO.

But to really drive traffic from one particular page, Google suggests:

It's a good idea to pick one of those URLs as your preferred (canonical) destination, and use 301 redirects to send traffic from the other URLs to your preferred URL. A server-side 301 redirect is the best way to ensure that users and search engines are directed to the correct page. The 301 status code means that a page has permanently moved to a new location.

But this 301 solution won’t help you, because you need users to be able to see the dev. site.

A note on canonical and alternative URLs

Note, it is perfectly reasonable for Google to send traffic to non-canonical URLs — different presentations of the same content can be appropriate in different contexts. Consider content you share at both at your regular “www.” site and a mobile “m.” site that is highly optimized for phones. Another example, Google might present a non-canonical PDF version if the user included “PDF” in their search phrase.

But why does Google like your “dev.” site anyway?

Google’s algorythm doesn’t care that your dev site might have unapproved content, and your users probably don’t either. (It also doesn’t much care how you or your bosses feel about this.)

Here are a few things Google does care about:

  1. Google rewards freshness of content. If you dev site changes much more often (it does, doesn’t it?) that may be a positive SEO signal.

  2. People on the web might have discovered your dev site and be linking to it for one reason or another.

  3. If your dev site has significant technical upgrades, or gets less traffic than your production site, it might be faster — and Google rewards speed.

Why an HTTP header solution would be better for you than a meta tag

If you use the X-Robots HTTP tag to return the noindex instruction, that can be configured on the web server, not on your HTML files or other content files. So you won’t need to change anything when you promote the files to your production site.