'Tutorials'

Instructing web spiders with a robots.txt file

29 APR 2011 3

Most sites have several pages you want to keep out of the reach of search engines. For example, there is no need to clutter Google’s results pages with your login pages or other  private pages. You can easily “tell” spiders the pages they are to stay away from with a robots.txt file.

A basic robots.txt

When a crawler visits your site, it first looks for a robots.txt file placed in the root of your domain that instructs it on which pages it should ignore. Such a file is made of one or more records, and each must contain a line to address a certain user agent followed by one or more Disallow lines. The syntax is therefore trivial – you don’t need to learn more than these two directives. For example, a robots.txt file made of

User-agent: googlebot Disallow: /login.php Disallow: /admin

would tell Google Bot not to crawl neither http://yourdomain.com/login.php nor http://yourdomain.com/admin.

Wildcards – not so wild

User agents may be matched by a wildcard. Instead of having a User-agent: spidername section for each crawler, you can instruct them all to follow the subsequent Disallow lines by using User-agent: *. The star symbol matches any number of characters – so “spider*” can stand for “spider A”, “spiderFromSomeSite” or “spiderFromThatOtherSite”. A question mark would match one character, so “spider?” will work for “spiderA”, “spiderB”, but not “spider X”.

Follow the syntax strictly

Robots.txt can be picky with syntaxes, so make sure you follow the structure:

  • Don’t mix and match. User-agent comes before Disallow, it won’t work if you put them the other way around.
  • Don’t use more than one URL in the Disallow line. “Disallow: /path1 /path2 /path3” won’t work; you are to put each path under its own Disallow rule.
  • Keep case sensitivity in mind. “/path1” and “/Path1” are two different URLs.
  • Inline comments don’t work. You might be used to placing comments after a “#” sign at the end of the row from Perl, PHP or shell scripting. With robots.txt, each comment has to be on its own line. For example, “Disallow: /admin # don’t index /admin” might confuse some spiders who will be looking for a “/admin#don’t” folder.
  • There is no “Allow” directive. As outlined in the beginning, the syntax of a robots.txt file is as easy as it can get. To allow spiders to index every file, simply place an empty Disallow line.

Beware of sensitive information

As a final note, remember that you shouldn't use robots.txt to keep spiders from accessing overly sensitive information. Just because a standards-compliant web spider won’t access it, it doesn’t mean a malicious one (or even a human user) can’t or won’t. Such a setting is a no-no:

User-agent: * Disallow: /admin/passwords.txt

Sensitive information such as users and passwords should be placed outside your htdocs path. Just because Google won’t crawl your passwords.txt file it doesn’t mean a malicious user won’t open up /admin/passwords.txt in a browser and read your “hidden” content.

Fill out the form below to get started

find out what we can do for you 877 543 3110