Consolidate duplicate URLs

30/11/2019

Summary

If you have a single page accessible by multiple URLs, or different pages with similar content (for example, a page with both a mobile and a desktop version), Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

If you don't explicitly tell Google which URL is canonical, Google will make the choice for you, or might consider them both of equal weight, which might lead to unwanted behavior, as explained below in Why should I choose a canonical URL?

More details

This is probably more information about canonicalization than you need to know, so feel free to skip it. However, we'll provide it in case you like this sort of thing.

When Googlebot indexes a site, it tries to determine the topics covered in each page. If Googlebot finds multiple pages on the same site that seem to be about the same thing, it chooses the page that it thinks is the most complete and useful, and marks it as canonical. The canonical page will be crawled most regularly; the duplicates are crawled less frequently in order to save crawling budget on your site. So if you don't tell Googlebot which is the canonical page, and you decide that a different page is canonical, you might be spending energy updating a page that Googlebot won't be indexing very often or displaying in search results.

Google uses the canonical pages on your site as the gold standard of your site's content, as far as evaluating content and quality, and the Google Search result usually points to the canonical page, unless one of the duplicates is explicitly better suited to a user's query: for example, the search result will probably point to the mobile page if the user is on a mobile device, even if the desktop page is marked as canonical.

Google chooses the canonical page based on a number of factors (or signals), such as whether the page is served via http or https; the user's declared preferred domain; page quality; presence of the URL in a sitemap; and any "rel=canonical" labeling. You cannot force Google's choice of canonical page, but you can influence the choice by using one or more of the techniques shown here.

Why would I have similar/duplicate pages?

There are valid reasons why your site might have different URLs that point to the same page, or have duplicate or very similar pages at different URLs. Here are the most common:

To support multiple device types:

https://example.com/news/koala-rampage
https://m.example.com/news/koala-rampage
https://amp.example.com/news/koala-rampage

To enable dynamic URLs for things like search parameters or session IDs:

https://www.example.com/products?category=dresses&color=green
https://example.com/dresses/cocktail?gclid=ABCD
https://www.example.com/dresses/green/greendress.html

If your blog system automatically saves multiple URLs as you position the same post under multiple sections.

https://blog.example.com/dresses/green-dresses-are-awesome/
https://blog.example.com/green-things/green-dresses-are-awesome/

If your server is configured to serve the same content for www/non-www http/https variants:

http://example.com/green-dresses
https://example.com/green-dresses
http://www.example.com/green-dresses

If content you provide on a blog for syndication to other sites is replicated in part or in full on those domains:
https://news.example.com/green-dresses-for-every-day-155672.html (syndicated post) https://blog.example.com/dresses/green-dresses-are-awesome/3245/ (original post)

Why should I choose a canonical URL?

There are a number of reasons why you would want to explicitly choose a canonical page in a set of duplicate/similar pages:

To specify which URL that you want people to see in search results. You might prefer people reach your green dresses product page via https://www.example.com/dresses/green/greendress.html rather than https://example.com/dresses/cocktail?gclid=ABCD.
To consolidate link signals for similar or duplicate pages. It helps search engines to be able to consolidate the information they have for the individual URLs (such as links to them) into a single, preferred URL. This means that links from other sites to http://example.com/dresses/cocktail?gclid=ABCD get consolidated with links to https://www.example.com/dresses/green/greendress.html.
To simplify tracking metrics for a single product/topic. With a variety of URLs, it's more challenging to get consolidated metrics for a specific piece of content.
To manage syndicated content. If you syndicate your content for publication on other domains, you want to consolidate page ranking to your preferred URL.
To avoid spending crawling time on duplicate pages. You want Googlebot to get the most out of your site, so it's better for it to to spend time crawling new (or updated) pages on your site, rather than crawling the desktop and mobile versions of the same pages.

Which URL does Google consider canonical (or duplicate)?

Use the URL Inspection tool to learn which page Google considers canonical. Note that even if you explicitly designate a canonical page, Google might choose a different canonical for various reasons, such as performance or content.

Specify a canonical page

There are a few different ways to specify the canonical page among a duplicate set, depending on your usage:

Method	Description
General guidelines	Follow these guidelines for all canonicalization methods.
Specify the preferred domain	Use Search Console to specify URLs on one domain as canonical over their counterparts on another domain. For example, example.com rather than www.example.com. Use this only when you have two similar sites that differ only by subdomain. Don't use this for http/https counterpart sites. Pros: Very easy to implement, manage, and change Use if you have identical sites on different domains. Cons: Works only at the domain granularity, and the pages must have identical paths and names to be considered duplicates. Enables only a single page-to-page mapping for identical path named pages.
rel=canonical tag	Add a tag in the code for all duplicate pages, pointing to the canonical page. Pros: Can map an infinite number of duplicate pages. Cons: Can add to the size of the page. Can be complex to maintain the mapping on larger sites, or sites where the URLs change often. Only works for HTML pages, not for files such as PDF. In such cases, you can use the rel=canonical HTTP header.
rel=canonical HTTP header	Send a rel=canonical header in your page response. Pros: Doesn't increase page size. Can map an infinite number of duplicate pages. Cons: Can be complex to maintain the mapping on larger sites, or sites where the URLs change often.
Sitemap	Specify your canonical pages in a sitemap. Pros: Easy to do and maintain, especially on large sites. Cons: Googlebot still must determine the associated duplicate for any canonicals that you declare in the sitemap. Less powerful signal to Googlebot than the rel=canonical mapping technique.
301 redirect	Use 301 redirects to tell Googlebot that a redirected URL is a better version than a given URL. Use this only when deprecating a duplicate page.
AMP variant	If one of your variants is an AMP page, you will need to follow the AMP guidelines to indicate the canonical page and AMP variant.

While we encourage you to use any of these methods, none of them are required. If you don't indicate a canonical URL, we'll identify what we think is the best version or URL.

General guidelines

For all canonicalization methods, follow these general guidelines.

General guidelines

Don't use the robots.txt file for canonicalization purposes.
Don't use the URL removal tool for canonicalization: it removes all versions of a URL from search.
Don't specify different URLs as canonical for the same page using the same or different canonicalization techniques (for example, don't specify one URL in a sitemap but a different URL for that same page using rel="canonical").
Don't use noindex as a means to prevent selection of a canonical page. This directive is intended to exclude the page from the index, not to manage the choice of a canonical page.
Do specify a canonical page when using hreflang tags. Specify a canonical page in same language, or the best possible substitute language if a canonical doesn't exist for the same language.

Prefer HTTPS over HTTP for canonical URLs

Google prefers HTTPS pages over equivalent HTTP pages as canonical, except when there are issues or conflicting signals such as the following:

The HTTPS page has an invalid SSL certificate.
The HTTPS page contains insecure dependencies (other than images).
The HTTPS page redirects users to or through an HTTP page.
The HTTPS page has a rel="canonical" link to the HTTP page.

Although our systems prefer HTTPS pages over HTTP pages by default, you can ensure this behavior by taking any of the following actions:

Add redirects from the HTTP page to the HTTPS page.
Add a rel="canonical" link from the HTTP page to the HTTPS page.
Implement HSTS.

To prevent Google from incorrectly making the HTTP page canonical, you should avoid the following practices:

Bad SSL certificates and HTTPS-to-HTTP redirects cause us to prefer HTTP very strongly. Implementing HSTS cannot override this strong preference.
Including the HTTP page in your sitemap or hreflang entries rather than the HTTPS version.
Implementing your SSL/TLS certificate for the wrong host-variant: for example, example.com serving the certificate for www.example.com. The certificate must match your complete site URL, or be a wildcard certificate that can be used for multiple subdomains on a domain.

Tell Google to ignore dynamic parameters

Use Parameter Handling to tell Googlebot about any parameters that should be ignored when crawling. Ignoring certain parameters can reduce duplicate content in Google's index and make your site more crawlable. For example, if you specify that the parameter sessionid should be ignored, Googlebot will consider the following two URLs as duplicates:

https://www.example.com/dresses/green.php?sessionid=273749
https://www.example.com/dresses/green.php

Specific methods

Choose one of the following methods to specify a canonical URL for duplicate URLs or duplicate/similar pages.

Be sure to follow the general guidelines above for all methods.

Set a preferred domain

Use Search Console to tell Google which version of your site's URL you prefer as canonical for your domain:

https://www.example.com
https://example.com

If you set your preferred domain as https://example.com, Google treats similar URLs or pages on www.example.com as duplicates of pages on example.com.

Read Set your preferred domain for details.

Use rel="canonical" link tag

You can use a tag in the page header to indicate when a page is a duplicate of another page.

Suppose you want https://example.com/dresses/green-dresses to be the canonical URL, even though a variety of URLs can access this content. Indicate this URL as canonical with these steps:

Mark all duplicate pages with a rel="canonical" link element. Add a element with the attribute rel="canonical" to the section of duplicate pages, pointing to the canonical page, like this one:
If the canonical page has a mobile variant, add a rel="alternate" link to it, pointing to the mobile version of the page:
Add any hreflang or other redirects appropriate for the page.

Use absolute paths rather than relative paths with the rel="canonical" link element.

Use this structure: https://www.example.com/dresses/green/greendresss.html
Not this structure: /dresses/green/greendress.html

Use rel="canonical" HTTP header

If you can configure your server, you can use rel="canonical" HTTP headers (rather than HTML tags) to indicate the canonical URL for non-HTML documents such as PDF files.

For example, if you expose a PDF file through multiple URLs, you can return a rel="canonical" HTTP header such as the following for the duplicate URLs to tell Googlebot what is the the canonical URL for the PDF file:

Link: ; rel="canonical"

Google currently supports this method for web search results only.

Use absolute paths rather than relative paths with the rel="canonical" link element. That is:
Use this structure: http://www.example.com/downloads/white-paper.pdf
Not this structure: /downloads/white-paper.pdf

Use a sitemap

Pick a canonical URL for each of your pages and submit them in a sitemap. All pages listed in a sitemap are suggested as canonicals; Googlebot will decide which pages (if any) pages are duplicates, based on similarity of content.

We don't guarantee that we'll consider the sitemap URLs to be canonical, but it is a simple way of defining canonicals for a large site, and sitemaps are a useful way to tell Google which pages you consider most important on your site.

Don't include non-canonical pages in a sitemap. If using a sitemap, specify only canonical URLs in the sitemap.

Use 301 redirects for retired URLs

Use this method when you want to get rid of existing duplicate pages, but need to ensure a smooth transition before you retire the old URLs.

Suppose your page can be reached in multiple ways:

https://example.com/home
https://home.example.com
https://www.example.com

Pick one of those URLs as your canonical URL and use 301 redirects to send traffic from the other URLs to your preferred URL. A server-side 301 redirect is the best way to ensure that users and search engines are directed to the correct page. The 301 status code means that a page has permanently moved to a new location.

If you are on a website hosting service, do a search for their documentation on setting up 301 redirects.

* Nguồn: Google Search Console