A search engine is an automated system that explores every page on your site using web robots (also known as web wanderers, crawlers, or spiders), the programs that traverse the web automatically, and build an index. However they can also be used inappropriately, such as spammers using them to scan for email addresses. There are several reasons you may not want search engines to access and index some or all your website pages. One typical need is to keep duplicate content, such as print versions or pages available both in HTML and PDF or word processor formats, out of a search engines' index.

Search engines obey a set of rules which you can specify.

Using metatag

To use a special <meta> tag to prevent search engine indexing for a particular page, put the following tag inside the <head> section; that is, before the <body> tag:

<meta name="robots" content="noindex, nofollow, noarchive">

noindex prevents indexing of anything on the page, nofollow prevents the search engine from exploring the links on the page, noarchive does not allow to archive the page. You can also substitute noimageindex for noindex if you want the text to be indexed by search engines but not the images.

See also an external article on About the Robots <META> tag for more details.

Using robots.txt

To control access to your entire site via a central location, create a text-only file called robots.txt using a plain text editor and upload it to \htdocs directory. In this file, you can specify general rules for all search engines and directories, followed by specific rules for particular engines and directories.

  • To allow all robots to crawl all files:
    #User-agent: *
    #Disallow:
  • To allow a specific robot to crawl all files:
    #User-agent: Yandex
    #Disallow:
    #User-agent: *
    #Disallow: /
  • To keeps all robots out:
    #User-agent: *
    #Disallow: /
  • To tells all robots not to enter into four directories of a website:
    #User-agent: *
    #Disallow: /cgi-bin/
    #Disallow: /images/
    #Disallow: /tmp/
    #Disallow: /private/
  • To tell a specific robot not to enter a specific directory:
    #User-agent: BadBot
    #Disallow: /private/
  • To tell all robots not to enter a specific file called example.html:
    #User-agent: *
    #Disallow: /domain1/example.html/

See also an external article on About /robots.txt for more details.

Other methods

Several other methods are descibed in an external article on 6 methods to control what and how your content appears in search engines for more details, which also includes detailed information on the two methods described above.