The robots.txt file

When a web robot initially visits a domain, it first searches for a “robots.txt” file in the web root directory of where the public html pages are placed. This file should always be placed immediately after your domain name, and not in another folder, ie. it must be addressed as “www.yoursite.com/robots.txt”

This is standard protocol, and if there is such a file the robot reads instructions on which areas of the domain it can or cannot visit. If there is no “robots.txt” file within a domain, the robot gathers that it can traverse all areas within the domain.

Generally, the standard format for a robots.txt file starts with an instruction for a particular robot, or all robots in general if no one robot is specified, followed by the field to ‘Disallow’ robots from certain sections of the web site, or to permit them to crawl all web pages.

To exclude a robot from visiting certain areas within your domain, like the /cgi-bin/ for example, simply include those details in your file.

User-agent *
Disallow: /cgi-bin/

User-agent *
Disallow: /

To allow robots to traverse your site, the robots.txt file should look like this:

User-agent *
Disallow:

or just create an empty robots.txt file.

In the process, you can also variate for your own reasons whether you want to exclude a particular robot and allow others.

User-agent: LousyBot
Disallow: /

There is no hard and fast rule that robots will definitely follow this protocol, though generally most do. You will from your logs however notice that some unscrupulous bots do actually visit pages that they are not supposed to.

Be aware that a robots.txt file does not guard your web site against robots crawling your web pages, so don’t try to hide any material in your public domain that you feel is entirely confidential.

For further details and frequently asked questions on the robots.txt file visit robotstxt.org

Tags: , , , , ,

Leave a Reply

You must be logged in to post a comment.