Generate a Robots.txt File

When a crawler visits a web site, such as http://www.yourshophere.com/, it firsts checks for http://www.yourshophere.com/robots.txt. If the document exists, it analyzes the contents to see what pages on the site it is allowed to index. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files.

What Is the Robots.txt File?

Here is a sample robots.txt file that prevents all robots from visiting the entire site:
User-agent: * # applies to all robots Disallow: / #
disallow indexing of all pages 
The robot looks for a /robots.txt URI on your site, where a site is defined as an HTTP server running on a particular host and port number. There can only be a single /robots.txt on a site.
Note: Robots.txt is delivered as UTF-8 content type.

Creating Robots.txt for Single or Multiple Sites

If you want to create a robots.txt for one or more sites individually, you can use Business Manager to create the file. This robots.txt file is served to any requesting crawlers from the application server. It's stored as a site preference and can be replicated from one instance to another.

If you want to create a single robots.txt file that can be used for multiple sites, you can use Google's Webmaster Tools to create this file. However, you must have created a Google account to do so. If you choose not to use Google, you can use other third-party tools to create this file. This file must be uploaded to your cartridge after you create it. You must also invalidate the Static Content Cache for a new or different robots.txt file to be generated or served.

Note: The person who creates the robots.txt file must have permissions to turn off Storefront Password Protection and use the Business Manager Administration module and to upload code to a cartridge, depending on how you want to generate your file. If no one in your organization has the permissions to turn off storefront password protection, contact Commerce Cloud Support.
Tips:
  • You can write up to 50,000 characters to this file in Business Manager.
  • URIs are case-sensitive, and "/robots.txt" string must be all lower-case.
  • Blank lines are not permitted within a single record in the "robots.txt" file.
  • There must be exactly one User-agent field per record. The robot should be liberal in interpreting this field.
  • A case-insensitive substring match of the name without version information is recommended.
  • If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records.
  • It isn't allowed to have multiple such records in the "/robots.txt" file.
  • The "Disallow" field specifies a partial URI that isn't to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
    Disallow: /help
    disallows both /help.html and /help/index.html, whereas
    Disallow: /help/
    would disallow /help/index.html but allow /help.html. An empty value for Disallow, indicates that all URIs can be retrieved.
  • At least one Disallow field must be present in the robots.txt file.
  • A field like Allow: / isn't valid and will be ignored.

Understanding Storefront Password Protection and the Robots.txt File

Before creating a robots.txt file, it is important to understand how the Storefront Password Protection settings for your site affect what can be crawled. If Storefront Password Protection is enabled, a robots.txt file is automatically generated and denies access to all static resources for a site. If Storefront Password Protection is disabled, the robots.txt file determines whether content is crawled. Because Storefront Password Protection automatically generates a robots.txt file, it must be disabled before you can specify another type of robots.txt file.

To Create a Robots.txt File Using Business Manager

  1. Select site > SEO > Robots.
  2. Select the instance type to create a robots.txt file.
    Note: If you want to create a robots.txt file for a Production instance, you can do so on a Staging instance and replicate the site preferences, where the robots.txt file definition is stored, from the Staging instance to the Production instance.
  3. Select one of the following options:
    • Use the robots.txt file from a deployed cartridge: Use Google Webmaster Tools or another third party to generate your robots.txt file. Add the file to a cartridge on your site path. There can only be one robots.txt file per site. If you want to generate a robots.txt file using another tool and upload it to your cartridge. This option is most useful if you want to use the same robots.txt file for multiple sites.

      This is not recommended, because usually you want to have different settings for different instance types. For example, you don't want your sandbox or staging sites to be crawled, but you do want your production sites to be crawled. This can cause issues when replicating code to production. This option is only selected before a site goes live to test the robots.txt file.

    • Define an instance type-specific robots.txt: Use this option to have Salesforce B2C Commerce generate a robots.txt file for you or specify a custom robots.txt file for each of your instances.
      • All spiders are allowed to access any static resources (recommended for Production): Use this if you want your storefront to be crawled and available to external search engines, such as Google. This generates a site-specific robots.txt file that indicates spiders can crawl the static resources for the site.
      • All spiders are disallowed to access any static resources (recommended for Staging): Use this if you do not want your storefront to be crawled and available to external search engines, such as Google. This generates a site-specific robots.txt file that indicates to spiders that they shouldn't crawl the static resources for the site.
      • Custom robots.txt definition: Use this option if you want to control which parts of your storefront are crawled and available to external search engines.
  4. Click Apply.
  5. Select Administration > Sites > Manage Sites > site > Cache .
  6. In the Static Content and Page Caches section, click Invalidate.

To Create a Robots.txt File Using Google's Webmaster Tools:

  1. Log in to Google Webmaster Tools using your Google account.
  2. Click Tools.
  3. Click Generate robots.txt.
  4. Specify rules for site access.
  5. In the Files or Directories box, type /.
  6. Add extra files or directories on separate lines.
  7. Click Add to generate the code for your robots.txt file.
  8. Save your robots.txt file by downloading the file or copying the contents to a text file and saving it as robots.txt.
  9. Select Administration > Sites > Manage Sites > site > Cache tab.
  10. In the Static Content and Page Caches section, click Invalidate.

For information on where to upload your robots.txt file, see Uploading Your Robots.txt File.

Understanding Caching and the Robots.txt File

If caching isn't enabled on your Staging site, any changes to the robots.txt file are immediately detected. However, if caching is enabled for your Staging instance, you must invalidate the Static Content Cache for a new or different robots.txt file to be generated or served. This requires permissions in the Administration module. The following instructions include information on invalidating the cache, though you might not need them if you don't have caching enabled.

Best Practices to Control Crawler Activity on the Storefront

If you already have entries in the robots.txt, add the following directives at the bottom.


               # Search refinement URL parameters
               Disallow: /*pmin*
               Disallow: /*pmax*
               Disallow: /*prefn1*
               Disallow: /*prefn2*
               Disallow: /*prefn3*
               Disallow: /*prefn4*
               Disallow: /*prefv1*
               Disallow: /*prefv2*
               Disallow: /*prefv3*
               Disallow: /*prefv4*
               Disallow: /*srule*

Set the Googlebot crawl rate to Low through the Google Webmaster tools, as Google ignores crawl-delay directive in robots.txt, outlined in https://support.google.com/webmasters/answer/48620?hl=en.