fbpx

New Solution to an Old SEO Problem: Duplicate Content

Chances are your website has duplicate content. This is a common SEO issue affecting many sites and can have a detrimental impact on search engine rankings.

The good news is that this problem can now be fixed easily due to the introduction by search engines of the rel=”canonical” tag.

This article explains how to apply this tag to your website so you avoid problems associated with duplicate content.

First though, just what is duplicate content? And should you care?

Google defines duplicate content as “substantive blocks of content within or across domains that either completely matches other content or is appreciably similar.”  Whilst there are unscrupulous webmasters who steal and scrape content from other sites, Google recognises that most duplicate content is not deceptive or deliberate.

duplicated-content

Many sites inadvertently contain duplicate content simply because pages within the site can be found under multiple distinct URLs. Common examples of this are forums, blogs & ecommerce catalogue websites.

With blogs the same content can commonly be found with different URLs under different categories & archives. Ecommerce catalogue websites also frequently present the same content under different categories. And this is further compounded if session IDs, tracking IDs, affiliate IDs or other parameters are used.

For a long time in the SEO community there was a common misconception that Google penalises sites for containing duplicate content. However last year Google advised they recognise many sites do contain duplicate content and that they don’t penalise sites for this. (The exception is if duplicate content is being used to manipulate rankings and deceive search engine users).

What Does This Mean For Your Website?

So unless you’re operating on the dark side and scraping content from other websites does this mean you no longer need to worry about having duplicate content within your website? Unfortunately the answer is NO.

Whilst there aren’t duplicate content penalties per se, duplicate content can still hurt how your site ranks in Google and other search engines.

The reason for this is that search engines don’t want to include duplicate versions of a page in their search results because this undermines the credibility of their results. So when a site contains the same content under multiple URLs search engines will choose one version to list in their results and ignore the rest. The page chosen by search engines is called the “canonical” page. A canonical page is simply the preferred version of a set of pages with highly similar content.

avoiding duplicate content

Google usually does a pretty good job at picking the right page, but they don’t always get it right and may end up listing a different version of the page to what you’d prefer. Even when they do get it right, the major problem of duplicate content is that you lose the benefits of link “juice” from sites that are linking to duplicate versions of the page, rather than the canonical URL. (Remember links play a major role in search engine rankings and because search engines ignore duplicate versions of pages link juice to these duplicate pages is lost. As a result your search engine rankings can suffer).

If you have a large site with lots of duplicate pages a further problem is that your entire website may not be indexed because search engines are spending too much time crawling and sorting duplicate pages.

Traditionally the way of dealing with duplicate content has involved using the robots.txt file and no index meta tags to block duplicate pages from being indexed, plus 301 re-directs. Doing this can be tricky and, if you get it wrong, have disastrous results.

Canonical Link Tag

Fortunately the major search engines have made managing duplicate content a whole lot easier for webmasters through the recent introduction of the rel=”canonical” link tag. Using this tag allows you to specify to search engines the canonical version of each page you want them to index.

canonical-content

To do this you simply add the tag into the <head> section of duplicate pages you don’t want indexed.  The tag has the format below and in the “href=” part you specify the URL you want search engines to treat as the preferred or canonical version of the page.

<link rel=”canonical” href=http://www.mysite.co.nz/realpageurl.html>

The cool thing about this tag is that not only does it prevent search engines from indexing the wrong version of a page, but even better, it ensures links pointing to duplicate pages are attributed to the canonical version of the page. So even if other sites are pointing to duplicate versions of a page you’ll now still benefit from this link juice.

Here is an example of this being used:

A site selling clothing may have multiple pages listing an identical set of shirts with the only difference being the order in which items are listed, e.g. one page may display shirts sorted in style order, while another page displays the same shirts listed by price.

http://www.example.co.nz/product.php?item=shirts&sort=style

http://www.example.co.nz/product.php?item=shirts&sort=price

Apart from sort order the above pages are identical. To specify a canonical link to the page http://www.example.co.nz/product.php?item=shirts the following should be added to the <head> section of the above non-canonical pages:

<link rel=”canonical” href=”http://www.example.co.nz/product.php?item=shirts”/>

This can also be used if pages are available in both secure & non-secure formats (i.e. https & http). By using rel=”canonical” you can specify one as the canonical version of the page.

Eg say you want the non-secure version of http://www.mysite.co.nz/widgets to be canonical.  To do this add <link rel=”canonical” href=”http://www. mysite.co.nz/widgets “/> to the <head> section of the secure page version (i.e. https://www.mysite.co.nz/widgets)

Not A Complete Fix

Using this method to manage duplicate content is something all webmasters should be able to implement. If you do use this something to bear in mind is that this tag is a suggestion rather than a directive and search engines will ignore it if the canonical version of the page has:

  • Not yet been indexed
  • The canonical page is a duplicate of another page also specified as canonical
  • The canonical page returns a 404 error.

If any of these situations apply then the search engines will revert to choosing a canonical version of the page, as they will if you don’t use the rel=”canonical” tag.

Spammers Beware

Finally, devious webmasters may be thinking they can use this tag as a way to redirect link juice from less important to more important pages that aren’t duplicates, whilst still allowing the less important pages to still be found by visitors. We advise you not to try this. The canonical tag is only supposed to be used for pages that actually are duplicates. If you abuse it and apply the tag to pages that aren’t duplicates then you run a real risk of getting your site penalised.

So long as you use this tag correctly it should make managing duplicate content issues much easier for you than before. For more information from Google about canonicalization check out http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=139394

About the Author Mark Sceats

Mark is a Partner and Senior Consultant at SureFire which he founded back in 2002. Prior to establishing SureFire he worked for KPMG Consulting. Today Mark heads up SEO, embracing the challenges that can come with complex website implementations. Outside of work, his interests beyond his family are running, snowsports, diving and fishing (badly).

follow me on: