Control search engine indexing with robots.txt

If you wish to restrict all or part of your website from being indexed by various search engine robots you can use a robots.txt file.

For it to work properly it should be a simple ASCII text file named exactly “robots.txt” and it should be placed in the domain root directory. The well behaved robot will look at this location for instructions before indexing anything on the website.

You will need a separate robots.txt in the root directory for every sub domain you have as well. Apart from the root directory, a robots.txt file in any other location such as a subdirectory, will be ignored.

The basic syntax involves two lines.

User-agent: the robot the following rule applies to
Disallow: the pages you want to block

Here is a robots.txt that will block an entire site. An asterisk indicates all robots should be blocked.

User-agent: *
Disallow: /

This will allow an entire domain. You can achieve the same thing by removing the robots.txt file as well.

User-agent: *
Disallow:

You can block a specific robot.

User-agent: googlebot
Disallow: /

Block a specific directory. Make sure you include the forward slash.

User-agent: googlebot
Disallow: /sample_directory/

Block a specific file.

User-agent: googlebot
Disallow: /sample_file.htm

Block a multiple directories and files.

User-agent: *
Disallow: /sample_directory1/
Disallow: /sample_directory2/
Disallow: /sample_file1.htm
Disallow: /sample_file2.htm

Block everything for every robot except for google which can index everything.

User-agent: *
Disallow: /

User-agent: googlebot
Disallow:

Control search engine indexing with robots.txt

Fix blue tinted video in Ubuntu

Setup SSH access between VirtualBox Host and Guest VMs

Install GNOME Shell in Ubuntu 10.10 Maverick

Setup the PS3 Bluetooth Controller on Ubuntu

How to correctly use LD_LIBRARY_PATH