Monday, March 5, 2007

Introduction To robots.txt

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
Robots can be used for a number of purposes:

* Indexing
* HTML validation
* Link validation
* "What's New" monitoring
* Mirroring

Search engine robots will check a special file in the root of each server called robots.txt, which is a plain text file. Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can disallow access to temporary directories, private and cgi, for example, because they do not want pages in those areas indexed.

If you want to exclude the whole website then you can do this by robots.txt file
User-agent: *
Disallow:/

The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed, everything is allowed.

Remove an image from Google Image Search

User-agent: Googlebot-Image
Disallow: /images/pigs.jpg
This will remove image named pigs.
To remove all the images on your site from our index, place the following robots.txt file in your server root:
User-agent: Googlebot-Image<
Disallow: /
some patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry:
User-agent: Googlebot-Image
Disallow: /*.gif$


# disallow all files in these directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /comments/
Disallow: /z/j/
Disallow: /z/c/
Disallow: /about/legal-notice/
Disallow: /about/copyright-policy/
Disallow: /about/terms-and-conditions/
Disallow: /about/feed/
Disallow: /about/trackback/
Disallow: /contact/
Disallow: /stats*
Disallow: /tag
Disallow: /category/uncategorized*


# disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.txt$

# disallow all files in /wp- directorys
Disallow: /wp-*/

# disallow all files with? in url
Disallow: /*?

Robots.txt Checker Tool
http://tool.motoricerca.info/robots-checker.phtml

No comments: