Sitemap Topology

To include your site in its search results, Google uses a googlebot to crawl your site. Google knows what pages are new in your site using the sitemap you provide. When Google requests the sitemap, the SiteMap pipeline in your application serves the sitemap to Google, using the SendGoogleSiteMap pipelet.

The sitemap is composed of one index file (sitemap_index.xml), and one or more actual sitemap files (sitemap_0.xml, sitemap_1.xml, ...).

The index file is the file that you register with the search engine, which contains the locations of all sitemap files. Its only purpose is to point to the actual sitemaps. The actual number of sitemap files are determined by the configurable URLs per sitemap file (for example, 50,000) or by a maximum size of 10 MB per sitemap file, whichever condition is reached first.

Note: The SendGoogleSiteMap pipelet can be used to serve any search engine request for your sitemap.

The googlebot indexes your site, which tells Google exactly what information is included on each page. The googlebot uses the robots.txt file to determine what parts of the site to index. Googlebots don't index parts of your site disallowed by the robots.txt file, even if the URLs are included in your sitemap.

You can generate a robots.txt file using the Google Webmaster tools and use it across several sites. You can also generate the robots.txt file in Business Manager and serve it to crawlers per site.

Related Links

Generating Sitemaps

Generating a Robots.txt File