What are sitemaps?
Sitemaps are a way to tell Google about pages on your site we might not otherwise discover. In its simplest terms, an XML Sitemap—usually called Sitemap, with a capital S—is a list of the pages on your website. Creating and submitting a Sitemap helps make sure that Google knows about all the pages on your site, including URLs that may not be discoverable by Google’s normal crawling process.
In addition, you can also use Sitemaps to provide Google with metadata about specific types of content on your site, including video, images, mobile, and News. For example, a video Sitemap entry can specify the running time, category, and family-friendly status of a video; an image Sitemap entry can provide information about an image’s subject matter, type, and license. You can also use a Sitemap to provide additional information about your site, such as the date it was last updated, and how often you expect the page to change. We recommend that you use a separate Sitemap to submit News information.
Sitemaps are particularly helpful if:
- Your site has dynamic content.
- Your site has pages that aren’t easily discovered by Googlebot during the crawl process—for example, pages featuring rich AJAX or images.
- Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn’t well linked, it may be hard for us to discover it.)
- Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.
Google doesn’t guarantee that we’ll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site’s structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future. In most cases, webmasters will benefit from Sitemap submission, and in no case will you be penalized for it.
Google adheres to Sitemap Protocol 0.9 as defined by sitemaps.org. Sitemaps created for Google using Sitemap Protocol 0.9 are therefore compatible with other search engines that adopt the standards of sitemaps.org.
Finding information by crawling
We use software known as “web crawlers” to discover publicly available webpages. The most well-known crawler is called “Googlebot.” Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those webpages back to Google’s servers.
The crawl process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they look for links for other pages to visit. The software pays special attention to new sites, changes to existing sites and dead links.
Computer programs determine which sites to crawl, how often, and how many pages to fetch from each site. Google doesn’t accept payment to crawl a site more frequently for our web search results. We care more about having the best possible results because in the long run that’s what’s best for users and, therefore, our business.
Choice for website owners
Most websites don’t need to set up restrictions for crawling, indexing or serving, so their pages are eligible to appear in search results without having to do any extra work. That said, site owners have many choices about how Google crawls and indexes their sites through Webmaster Tools and a file called “robots.txt”. With the robots.txt file, site owners can choose not to be crawled by Googlebot, or they can provide more specific instructions about how to process pages on their sites.
Site owners have granular choices and can choose how content is indexed on a page-by-page basis. For example, they can opt to have their pages appear without a snippet (the summary of the page shown below the title in search results) or a cached version (an alternate version stored on Google’s servers in case the live page is unavailable). Webmasters can also choose to integrate search into their own pages with Custom Search.
Organizing information by indexing
The web is like an ever-growing public library with billions of books and no central filing system. Google essentially gathers the pages during the crawl process and then creates an index, so we know exactly how to look things up. Much like the index in the back of a book, the Google index includes information about words and their locations. When you search, at the most basic level, our algorithms look up your search terms in the index to find the appropriate pages.
The search process gets much more complex from there. When you search for “dogs” you don’t want a page with the word “dogs” on it hundreds of times. You probably want pictures, videos or a list of breeds. Google’s indexing systems note many different aspects of pages, such as when they were published, whether they contain pictures and videos, and much more. With the Knowledge Graph, we’re continuing to go beyond keyword matching to better understand the people, places and things you care about.