It sounds like an easy question, doesn’t it? While we hear a lot about duplicate content
since the Panda update(s), I’m amazed at how many people are still
confused by a much more fundamental question – which URL for any given
page is the canonical URL? While the idea of a canonical URL is simple
enough, finding it for a large, data-driven site isn’t always so easy.
This post will guide you through the process with some common cases that
I see every week.
It’s easy to smile and nod and all agree that we understand, but let’s put it to the test. In each of the following scenarios, how many pages does Google see?
Let’s dive right into a few examples. Please note: these are just examples. I’m not recommending any of the URL structures in this post as ideal – I’m just trying to help you determine the correct canonical URL for any given situation.
A good example of the kind of URLs I’m talking about are blog post URLs. Take these four:
In most cases, the technically correct home-page URL is:
Another common exception is if your site automatically redirects to another version of the home-page – ASP is notorious about this, and often lands visitors and bots at “index.aspx” or a similar page. While that situation isn’t ideal, you don’t want to cross signals. If the redirect is necessary, then the target of that redirect (i.e. the “index.aspx” URL) should be your canonical URL.
Finally, be very careful about situation #5 – in that case, as I discussed in the first section of this post, the “index.php” code template is actually driving other pages with unique content. Canonicalizing that to the root or to “index.php” could collapse your site to one page in the Google index. That particular scenario is rare these days, but some CMS systems still use it.
URLs #4-#6 are a bit trickier. Something like the currency selector in #4 can be very complicated and depends on how those selections are implemented (user selection vs. IP-based geo-location, for example). For Google’s purposes, you would typically want them to use the dominant price for the site’s audience and canonical to the main product URL (#1-#3, depending on the site architecture). Indexing every price variant, unless you have multiple domains, is just going to make your content look thinner.
With #5 and #6, the URL indicates a product variant, let’s say a T-shirt that comes in different colors and sizes. This situation depends a lot on the structure and scope of the content. Technically, your T-shirt in red/large is unique, and yet that page could look “thin” in Google’s eyes. If you have a variant or two for a handful of products, it’s no big deal. If every product has 50 possible combinations, then I think you need to seriously consider canonicalization.
In cases like #5-#6, Google recommends you use rel=prev/next for the pagination but then a canonical tag for the “&page=2” version (to collapse the sorts and filters). Implementing this properly is very complicated and well beyond the scope of this post, but the main point is that you should not canonicalize all of your search pages to page 1. Adam Audette has an excellent post on pagination that demonstrates just how tricky this topic is.
Let’s Play Count the Pages
Before we dive in, let’s cover the biggest misunderstanding that people have about “pages” on their websites. When we think of a page, we often think of a physical file containing code (whether it’s static HTML or script, like a PHP file). To a crawler, a page is any unique URL that it finds. One file could theoretically generate thousands of unique URLs, and every one of those is potentially a “page” in Google’s eyes.It’s easy to smile and nod and all agree that we understand, but let’s put it to the test. In each of the following scenarios, how many pages does Google see?
(A) “Static” Site
- www.example.com/
- www.example.com/store
- www.example.com/about
- www.example.com/contact
(B) PHP-based Site
- www.example.com/index.php
- www.example.com/store.php
- www.example.com/about.php
- www.example.com/contact.php
(C) Single-template Site
- www.example.com/index.php?page=home
- www.example.com/index.php?page=store
- www.example.com/index.php?page=about
- www.example.com/index.php?page=contact
Let’s dive right into a few examples. Please note: these are just examples. I’m not recommending any of the URL structures in this post as ideal – I’m just trying to help you determine the correct canonical URL for any given situation.
Case 1: Tracking URLs
I’ll start with an easy one. Many sites still use URL parameters to track visitor sessions or links from affiliates. No matter what the parameter is called or which purpose it’s used for, it creates a duplicate for every individual visitor or affiliate. Here are a few examples:- www.example.com/store.php?session=1234
- www.example.com/store.php?affiliate=5678
- www.example.com/store.php?product=1234&affiliate=5678
- www.example.com/store.php
- www.example.com/store.php?product=1234
Case 2: “Dynamic” URLs
Unfortunately, the word “dynamic” gets thrown around a little too freely – for the purposes of this blog post, I mean any URLs that pass variables to generate unique content. Those variables could look like traditional URL parameters or be embedded as “folders”.A good example of the kind of URLs I’m talking about are blog post URLs. Take these four:
- www.example.com/blog/1234
- www.example.com/blog.php?id=1234
- www.example.com/blog.php?id=1234&comments=on
- www.example.com/blog/20120626
- www.example.com/blog/this-is-a-blog-post
Case 3: The Home-page
It might seem strange to put the home page third, but the truth is that the first two cases were probably easier. Part of the problem is that home pages naturally spin out a lot of variations:- www.example.com
- www.example.com/
- www.example.com/default.html
- www.example.com/index.php
- www.example.com/index.php?page=about
In most cases, the technically correct home-page URL is:
- http://www.example.com/
Another common exception is if your site automatically redirects to another version of the home-page – ASP is notorious about this, and often lands visitors and bots at “index.aspx” or a similar page. While that situation isn’t ideal, you don’t want to cross signals. If the redirect is necessary, then the target of that redirect (i.e. the “index.aspx” URL) should be your canonical URL.
Finally, be very careful about situation #5 – in that case, as I discussed in the first section of this post, the “index.php” code template is actually driving other pages with unique content. Canonicalizing that to the root or to “index.php” could collapse your site to one page in the Google index. That particular scenario is rare these days, but some CMS systems still use it.
Case 4: Product Pages
In some ways, product pages are a lot like the blog-post pages in Case #2, except moreso. You can naturally end up with a lot of variations on an e-commerce site, including:- www.example.com/store.php?id=1234
- www.example.com/store/1234
- www.example.com/store/this-is-a-product
- www.example.com/store.php?id=1234¤cy=us
- www.example.com/store/1234/red
- www.example.com/store/1234/large
URLs #4-#6 are a bit trickier. Something like the currency selector in #4 can be very complicated and depends on how those selections are implemented (user selection vs. IP-based geo-location, for example). For Google’s purposes, you would typically want them to use the dominant price for the site’s audience and canonical to the main product URL (#1-#3, depending on the site architecture). Indexing every price variant, unless you have multiple domains, is just going to make your content look thinner.
With #5 and #6, the URL indicates a product variant, let’s say a T-shirt that comes in different colors and sizes. This situation depends a lot on the structure and scope of the content. Technically, your T-shirt in red/large is unique, and yet that page could look “thin” in Google’s eyes. If you have a variant or two for a handful of products, it’s no big deal. If every product has 50 possible combinations, then I think you need to seriously consider canonicalization.
Case 5: Search Pages
Now, the ugliest case of them all – internal search pages. This is a double-edged sword, since Google isn’t a fan of search-within-search (their results landing on your results) in general and these pages tend to spin out of control. Here are some examples:- www.example.com/search.php?topic=1234
- www.example.com/search/this-is-a-topic
- www.example.com/topic
- www.example.com/search.php?topic=1234&page=2
- www.example.com/search.php?topic=1234&page=2&sort=desc
- www.example.com/search.php?topic=1234&page=2&filter=price
In cases like #5-#6, Google recommends you use rel=prev/next for the pagination but then a canonical tag for the “&page=2” version (to collapse the sorts and filters). Implementing this properly is very complicated and well beyond the scope of this post, but the main point is that you should not canonicalize all of your search pages to page 1. Adam Audette has an excellent post on pagination that demonstrates just how tricky this topic is.
No comments:
Post a Comment