Your Commerce Health Check

An Ideal eCommerce robots.txt File

An Ideal eCommerce robots.txt File

Published on 2023-01-31 by Rob Gould

What is a Robots.txt File?

A Robots.txt file is a, well, text file that helps web crawlers and other automated bots navigate their site. The file contains instructions such as which pages or sections of the site should or should not be visited. The file must be placed in the root directory of the website and its name must be robots.txt. Search engine crawlers will look for this file first before crawling the site and will (mostly) obey the rules specified in it

Why do you need robots.txt?

A robots.txt file is useful for several reasons:

  1. Prevent sensitive or confidential information from being indexed.

  2. Prevent the crawling of irrelevant pages, such as the shopping basket

  3. Avoid indexing duplicate content

  4. Prevent temporary pages from benign indexed, such as areas of the site under development.

  5. Limit the number of pages on the site that are crawled.

  6. To control the information that is available in search engine results

Robots.txt is used across almost every site but it does not guarantee that a page will or will not be indexed. It can be bypassed by certain user agents and others may only take it as advice, not necessarily adhered to.

Where does robots.txt go on a site?

The "robots.txt" file should be placed in the root directory of a website.

For example, if the website is “site.com” the robots.txt file will be located at “site.com/robots.txt”. The consistency of location allows search engine bots to find and access the file.

What do you add to your robots.txt on your eCommerce sites?

There is no one answer for any site, it will come down to the platform and site requirements.

But, there are some directives that are common across eCommerce websites themselves.

Sensitive areas of the site, including but not limited to the checkout pages and/or user account pages. 

Perhaps you have several URLs for the same product and need to restrict the crawling of the site to all but one of them?

Then there are those pages that do their thing in the background and are not seen by anyone visiting the site. They’re functional, but they don’t need to be crawled so we can block those as well.

Still, remember, the robots.txt file is only a request and some crawlers may ignore it.

Keep an eye on Google Search Console to see which pages are being crawled so you can update your robots.txt accordingly.

User-agent in robots.txt

The User-agent in a robots.txt file is used to specify which crawlers the instructions apply to. It will typically look like this for Googlebot:

User-agent: Googlebot

This tells Googlebot that the following instructions apply just to it.

To specify all User-agents, simply use a wildcard.

User-agent: *

How do 'Disallow' commands work in a robots.txt file?

The Disallow command is used to tell search engine crawlers which pages of a website should not be crawled or indexed.

This is used alongside a User-agent to specify which crawler the disallow instruction applies to.

For example:

User-agent: Googlebot Disallow: /thisfolder User-agent: * Disallow: /thatfolder

This tells Googlebot not to crawl the /thisfolder and all other crawlers not to crawl /thatfolder.

This command is not a guarantee that these pages will not be indexed, it only sends the instruction that they are not crawled.

Remember, this will not make any area of your website more secure. If fact if people visit your robots.txt file, which remembers is in a standard location and so easy to find, they will see a list of all of these folders. So before adding it in, make sure it’s something you’re happy for humans to find, even if you don’t want a robot to see it

Robots.txt File Limitations

This is purely a simple text file and while effective, it does have its limitation.

The fort two I’ve mentioned a couple of times in this article already. 

Firstly, it’s only a suggestion to the bot, not a direct order so things may still be crawled.

Secondly, it doesn’t provide any layer of security to the site itself.

The final point to note is that it only ‘works’ when the bot visits the robot.txt file itself. So if you made changes, they won’t come into effect until the bot's next visit.

What is the Sitemaps protocol? Why is it included in robots.txt?

A Sitemap (usually called sitemap.xml) allows search engine bots to see a full list of pages on a website.

The XML file not only contains a list of URLs but also data about that page such as when it was last updated and a page's relative priority based on the rest of the website.

A sitemap is usually included in a robots.txt file as a separate protocol.

For example, in the robots.txt file:

Sitemap: https://www.site.com/sitemap.xml

This allows a bot to easily find a sitemap and therefore have access to a list of all the pages on a website. This is especially useful for large eCommerce sites as it should include a full list of products that may not all be internally linked on the site.

Technical robots.txt syntax

OK, we’ve covered the basics so now it’s time to get technical.

I’ve mentioned already that the file must be placed in the root directory of the website and that it must be called robots.txt

Also, the file must be encoded in UTF-8. Each instruction needs to be on a new line and each line should be no more than 512 bytes. The files themselves shouldn't be more than 500kb.

You'll also want to look into using:

  • Wildcards (*) to block large chunks of your site in one go

  • Dollar Sign ($) to indicate the end of a URL

  • Question Marks (?) to block pages with parameters

For example, /*?$ refers to a URL that begins with your domain name, followed by anything else and ends in a question mark.

Also, remember this is case-sensitive and "Disallow: / " will block your entire site from being crawled so take care!

Robots.txt vs Meta Robots vs X-Robots

So we’ve covered what a robots.txt file does, and we’re up to date on that.

The meta robots is an HTML tag added to the head of a webpage to tell bots about how the page should be indexed.

It will look like this: <meta name="robots" content="instruction">.

The instruction will be either "index" or "noindex", "follow" or "nofollow" which will instruct if the page can be indexed, and the links on that page followed.

"x-robots" is a header that provides similar instructions as the meta robots tag, but it's more flexible. For example, “X-Robots-Tag: noindex” provides a noindex instruction.

But again, as always, bots can either ignore these completely or just take them under advisement. None of this guarantees that a page won’t be indexed if you don’t want it to be.

Summary

The ideal eCommerce robots.txt will vary from site to site and even if there is one with several 'Disallow' elements the bots may just ignore them anyway!

But a mix of robots.txt and header directives along with other on-page signals and monitoring of GSC should provide enough information to take the correct actions.

The ideal eCommerce robots.txt file is out there for you, you just have to do the work and find it.