Prevent my site from being copied

No, there's no way to do it. Without setting connection parameter limits, there's even no way to make it relatively difficult. If a legitimate user can access your website, they can copy its contents, and if they can do it normally with a browser, then they can script it.

You might setup User-Agent restrictions, cookie validation, maximum connections, and many other techniques, but none will stop somebody determined to copy your website.


Protect the part of the site you want to protect with a username and password. Then only assign a username and password to people who sign an NDA, (or similar) that says they won't extract or copy information from your site.

Another trick is to make all your content load from AJAX... and make the AJAX data URL load from paths that change (such as ~/todays-date) and sync that with javascript. Then even if someone were to download your content, the data would be out of date within 24 hours.

Even then, nothing will prevent a determined skilled attacker from getting an offline copy, you can just make it harder so it's not worthwhile.


As has @Adnan already pointed out in his answer, there is really no way of stopping a determined person from copying snapshots of your website. I used the word snapshots here, because that's what such content scrapers (or harvesters) are really copying. They don't (or at least shouldn't) have access to your backend where your website contents are actually generated and displayed to the end user, so the best they can do is copy its output, one that you can generate in such a way to change in time or adjust according to its intended recipient (DRM schemes, watermarking,...), as has @makerofthings7 pointed out in his answer.

So this much about what's already been answered. But there is one thing about this threat that I feel haven't yet been well covered in mentioned answer. Namely, most of such content scraping is done by opportunistic and automated web crawlers, and we see targeted attacks a lot rarer. Well, at least in numbers - bear with me.

These automated crawlers can actually be blacklisted quite effectively through the use of various WAFs (some might even use honeypots to determine the threats in heuristic ways) that keep updated database of blacklisted domains (CBLs or Community Ban Lists, DBLs or Domain Block Lists, DNSBLs or DNS-based Blackhole Lists,...) where such automated content scrapers are operating from. These WAFs will deny or grant access to your content serving web application based on three main approaches:

  • Deterministic blacklisting: These are detections based on characteristics of web requests that content scrapers will make. Some of them are: Request originating IP address, Reverse DNS resolved remote hostname, Forward-confirmed reverse DNS lookup (see explanation in one of my questions here), User agent string, Request URL (your web application could for example hide a honeytrap URL address that a content scraper might follow in one of its responses, after it determines the request didn't come from a whitelisted address such as legitimate search engine crawlers / spiders)... and other fingerprint information associated with automated web requests.

  • Heuristic blacklisting: This is a way to determine a threat either by weighting parameters of a single web request described in the deterministic approach (anti-spam filters use a similar approach based on calculating Bayesian probability), or by analyzing multiple web requests, such as: Request rate, Request order, Number of illegal requests,... that might help determine, if the request comes from a real and intended user, or some automated crawler.

  • External DNSBL/CBL/DBLs: I've already mentioned relying on external DNSBL/CBL/DBLs (e.g. Project Honey Pot, Spamhaus, UCEPROTECT,...), most of which are actually a lot more useful than merely keeping track of spammers and spambot infected hosts, and keep a type of offense (e.g. forum spammer, crawl rate abuse,) on top of IP addresses, hostnames, CIDR ranges,... in blacklists they publish as well. Some WAFs will come with the ability to connect to these databases, saving you the trouble of being targeted by the same actor that might have been already blacklisted for same detected activity on another web server.

Now, one thing needs to be said quite clearly - none of these methods can be considered bulletproof! They will remove the majority of offending web requests, which is valuable on its own and will let you focus better on those harder to detect offenders that somehow bypassed your protections.

There are of course countless techniques for both automated crawlers / content scrapers detection (and their own countermeasures - detection avoidance techniques) that I won't describe here, nor list all possible WAFs and their capabilities, not wanting test your patience or reach limits of the purpose of this Q&A. If you'd like to read more on what techniques can be employed to thwart against such unwanted visitors, then I recommend reading through the documentation on the OWASP Stinger and OWASP AppSensor projects.


Edit to add: Suggestions from HTTrack authors can be read in the HTTrack Website Copier FAQ: How to limit network abuse - Abuse FAQ for webmasters document, and the reasons why a single deterministic method of detection won't work (short of blacklisting offending IP addresses after the fact or through experience of other honeynets), if the adversary is set to obfuscate spider's user agent string by setting it to any of the many user agent strings of real and legitimate web browsers, and disrespect robots.txt directives, become rather apparent by glimpsing through the HTTrack Users Guide. To save you the bother of reading it, HTTrack includes simple configuration and command line flags to make it work in stealth mode and appear just as benign as any other legitimate user to simpler detection techniques.