Advertisements

Bad robots

As part of being on a VPS, bandwidth is limited. One of the things you have to watch for is bots, crawlers, and scrapers coming and stealing your content and bandwidth.

Some of these bots are good and helpful, like the Google, Yahoo, and Bing crawlers. They index your site so it will appear in the search engines. Others, like the Yandex bot, crawl and index your pages for a Russian search engine. If you have an English-only site targeting US visitors, you might want to consider blocking the Yandex bot.

In my searches I also came across the Dotbot, which seems to crawl your pages just to get your response codes. I’m not sure what they do with the data, but in my opinion it’s better to block them.

So how does one block these bots? The Robots Exclusion Protocol states that a file, called robots.txt, can be put in your DocumentRoot with directives for bots to follow. For example, if your domain is example.com, your robots.txt should be at the following URL:

http://example.com/robots.txt

The robots.txt directives can tell bots which files they are allowed to index and which they are not. Well-behaved web robots will look at this file before attempting to crawl your site, and obey the directives within. The directives are based on the bots UserAgent string. A couple of examples:

Block the Dotbot robot from crawling any pages:

UserAgent: dotbot
Disallow: /

Block all robots from crawling anything under the /foo/ directory:

UserAgent: *
Disallow: /foo/

The Google Webmaster Tools has an excellent tool for checking your robots.txt file. You can find instructions on how to access it here. Google account required.

However, not all bots obey (or even look at) the robots.txt file. Those that don’t need special treatment in the .htaccess file, which I’ll describe in another post.

Advertisements

, , ,