Learn SEO: The Robots.txt File

Learn how you can hack your way to the top with proven, data-backed tactics directly to your email every week.

The robots.txt tells engine crawlers if they are allowed or disallowed to request specific pages and files from your website. We use a robots.txt mainly to avoid overloading a site with requests. It’s especially useful when it comes to crawl budget optimization.

wikipedia-robots.txt
Wikipedia Robots.txt with multiple groups, rules and comments

Warning: A robots.txt does not keep a web page out of Google index. It just prohibits engines from crawling that page. If you want to remove a specific page or a collection of pages from the index, you need to use different methods, such as the noindex directive. 

  • A couple of things to keep in mind when it comes to robots.txt. 
  • All search engines may not support Robots.txt, or specific rules might only work for some search engines. 
  • Different engines might interpret syntax differently.
  • A disallow rule for a page within the robots.txt doesn’t mean that it won’t be indexed. Either password protect your pages or noindex them if you want them to stay hidden. 

File format and syntax guidelines

  • The file must always be a UTF-8 encoded text file. You can use almost any text editor to create a robots.txt but avoid word processors since they save files in their format and cause unexpected characters, causing problems. 
  • The name of the file must always be robots.txt
  • Only one robots.txt is allowed per site.
  • The robots.txt must be located at the root of the website host. For example, if the domain is, fourth-p.com the robots.txt is located at the following address fourth-p.com/robots.txt (try it out)
  • A subdomain can have its robots.txt on its root. 
  • Comments are allowed after a # (hashtag) character.
  • Only valid lines are considered. The others are ignored without warning or error.
  • Each line consists of a <field>:<value>
  • The <field> value is case-sensitive but the <value> depends from the what the <field> is.
  • Google enforces a size limit of 500 kibibytes. To reduce the size of your robots.txt, consolidate rules.
  • <field> elements with simple errors and typos are not supported. 
  • eg. “useragent” instead of “User-agent”
  • Limited use of “wildcards” is supported by most major search engines.
    • * designated zero or more instances of any valid character
    • $ designated the end of the URL

Robots.txt Groups

A robots.txt consists of one or more groups with one or multiple rules, one rule per line. The group ends when a new user-agent field is encountered or when the file ends. The last group can also have no rules, which by default, means that the last group’s user-agent is implicitly allowed all crawling of the site. 

The following information constructs a group. 

  • Who the group applies to (User-agent)
  • Which directories or files that agent can access (Allow)
  • Which directories or files that agent cannot access (Disallow)

Here’s a couple of example of a robots.txt groups

User-agent: Googlebot
Disallow: /*.gif$

User-agent: Googlebot-news
Disallow: /not-for-news/

User-agent: *
Allow: /

Group fields 

  • User-agent
  • Disallow
  • Allow

One or more User-agent fields are required per group. A User-agent can contain the name of a user-agent or a wildcard match. For example, 

At least one Disallow or one Allow is required per group.

Disallow directives specify a <path> which translates to a web page or file that the user-agent must not crawl. Similarly, the Allow directive specifies a <path> that the user-agent can access. 

The order of precedence when it comes to robots.txt group rules is as follows. 

  • The most specific rule based on the length of the path trumps the less specific. 
  • In the case of conflicting rules, the least restrictive rule is used.

For example

allow: /folder
disallow: /folder

User-agent will choose the allow rule since it's the least restrictive

Non-group fields 

Sitemap (Google supported)

The Sitemap is an optional non-group-member line. A sitemap is an excellent way to point to engines which pages they should crawl. The usual position is either at the top of the file or at the bottom. Absolute URL must be used, eg. https://example.com/sitemaps.xml.

Host (Not Google Supported)

The host field allows websites with multiple mirrors to specify their preferred domain.

Crawl-delay (Not Google Supported)

The crawl-delay field throttles the visit of the user-agent to the site. 

Robots.txt FAQ

  • Where can I find my websites robots.txt?

    Robots.txt location

    The robots.txt file lives at the root level of your site.

  • How to edit my robots.txt?

    The simplest way to edit your robots.txt is via FTP/SFTP. Simple open your root folder find your “robots.txt” download it, edit it and re-upload it.

    I would suggest you make a copy of it in case you forget what your previous settings used to look like.

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay informed

Join hundreds of entrepreneurs, marketers and SEO specialists receiving a weekly data-backed, proven SEO tactic straight to their email.