Do you have files listed in the search results that you do not really want in there? Got a statics site crawling your site and listing information about your site you do not want there? This can be fixed using a file in the root of your website or blog called the robots.txt file. This plain text file follows the Standard for Robot Exclusion which has specific instructions for not allowing bots accessing your site.
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.
Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.
Web crawler – Wikipedia
What is a robots.txt File?
A robots.txt file is a text file placed in the root of your website or blog that contains instructions for the bots (spiders, search engine bots).
When a bot first arrives at the site it looks for the robots.txt file. If it does not find one it will look for and gather information about all the files on your site. e.g. If Google’s bot (GoogleBot) visits your site and there is not a robots.txt file or instruction in your robots.txt file limiting what it is allowed to look at it will have a look at everything it can find and eventually all it found will be in it’s search results. Not a good thing if there is some stuff you want kept out of the search results.
Even if you have a robots.txt file there are some not so polite bots that will ignore your instructions. For things you want kept away from these bots it is best to put in a password protected area.
You should be aware that anyone can access your robots.txt file. Try it yourself. Type www.domainname.com/robots.txt in the browser address bar and you will see the contents of the site’s robots.txt file. This is why, even if polite bots obey your instructions, it is best to keep private stuff behind password protected folders. Any snoopy person or bot could have a look at what you are trying to hide if they see a folder marked as keep out please.
Purpose of the robots.txt File
The purpose of the robots.txt file is to tell the nice bots which areas of the site you do not want included in their search index.
What Can You Do with a robots.txt File?
Each instruction (and part of) needs to be on a separate line.
No blank lines between the parts of the instruction. The blank line indicates the end of the instruction.
If you wish to pick and choose which files are indexed, it is easier to put those files you do not want indexed in a separate folder with the instruction in the robots.txt file for the bots to stay out of the folder.
Here are a few samples of what instructions you can place in your robots.txt file:
Disallow Indexing of Specific Folders
To disallow the indexing of the contents in a specific folder the instruction is:
* indicates all bots
/images/ is the name of the folder. Don’t forget the / at the beginning of the folder name and at the end.
Disallow Specific Bots
Maybe there is a specific bot you do not want to index your information.
User-agent: Bot name
There are lists of User-agents and bad bots if you what to look up a specific User-agent/bot name.
Stop Images Indexed in Image Search
Some people want their images indexed in Google, Bing and Yahoo! image searches for the possible traffic but if you don’t you can let the image bots know this. e.g. For Google image bot
How to Create a robots.txt File
You will need a plain text editor. Something like Notepad (which somes with Windows) or Notepad++ is a plain text editor. Word and other word processing software are not plain text editors.
You will also need a folder on your computer to store this file until you are finished editing it. You have a backup of your site – right? If not, create one! Use FTP software to backup your site. For WordPress we also have specific instructions: Backup WordPress.
- Open your plain text editor.
- Use File/Save As from it’s top menu bar to navigate to the folder which contains a local copy of your website or blog.
Make sure you are in the root of the folder, not inside a folder within the website folder.
- Name the file robots.txt in the File Name box.
- Left click Save to save the file.
The empty file’s screen becomes active again.
- Type in the instructions you wish to have in your robots.txt file (see above for samples).
- Save the file when you are done.
The file can be closed also.
- Using FTP software (or your web hosting File Manager function) upload the robots.txt file to the root of your website/blog.
The root of your website is the folder where your website files are. Sorry can’t be more specific as each web hosting setup is different. If you are not sure which folder, look at your web hosting’s documentation.
Test you uploaded the robots.txt file to the right spot by opening your browser and typing http://www.yourdomainname.com/robots.txt. If you can see the contents of the file you just created you uploaded it to the right spot.
Testing the robots.txt File
Once you have created a robots.txt file it should be tested that there are no errors in it. Here are a few ways to test the file:
Google Webmaster Tools
The Robots.txt Checker testing tool is available to the public. Enter the web address of your robots.txt file in the box provided then click the Check robots.txt button below. On the resulting page it will explain each set of instructions you have entered. At the top of the page it will tell you if you have errors or not. The results also point out what line is incorrect.
Search Engine robots.txt Information
Below are links to two of the search engines’ robots.txt information:
- Block or remove pages using a robots.txt file – Google
- To crawl or not to crawl, that is BingBot’s question – Bing
Use the robots.txt File Carefully
Be sure to understand the instructions you are placing in the robots.txt file of your site. A simple mistake could be disasterous for your site.