At Risk of Being Redundant; Google On Duplicate Content

Leverage Archive

Leverage Archive

Woah! We've been at this a long time. What was true a year or two ago may not be true today. If you're interested in something a little more current, take a look at our recent blog posts.
Leverage Archive

Part of the job for the PPC or SEO analyst, consists of debunking… a large part of our job revolves around explaining rumors and misreads. For years, we’ve had to explain a particular twist of semantics that somehow has convinced people that if you have duplicate content on a web site, Google takes out a pen and puts a black mark by your name.

No, Virginia, there is no duplicate content penalty – at least not the way people think.

When faced with multiple pages that look too much alike, Google has to decide what’s what – why it’s seeing double, or whether there’s any malicious intent with what it’s finding. For the most part, the average website doesn’t practice malicious copying, nor do they usually scrape content from other sites. What happens is they end up with catalogs and parts listings that contain 80% plus duplicate wording, or end up with 16 different possible ways to land on the “same page” because there are that many different search options in their web catalog that will land you on the same exact item. This can cause confusion.

Google’s basic, and hopefully final, word on the subject is simple – when we find a bunch of pages that look really similar, we group them into a “cluster,” then we pick a single URL to represent all pages in that cluster. But then they do something else that the average webmaster probably never even notices.

“We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL.”

Notice the absence of any pens, or black marks, or slaps, or shackles or any other form of “punishment.” They just group all “duplicated” pages and consolidate the info under one indexed URL. For the average e-commerce site, this is not a problem.

But let’s say you are one of these folks who has 854,000 items in a dynamic, database-driven catalog and all items are shown on a “shell” page that gets populated by the shopper’s query when they’re looking for their item, but because of the way your catalog is built, the ONLY thing that changes on the pages is the image file name, the price, the part number, and the name of the item. Sounds like you probably have 845,000 duplicate pages that will not be indexed individually. If you’re looking for some massive number of “pages indexed” (for whatever reason), you are quite liable to be disappointed. Until there is enough variance between items, like a longer description, or some individualized stats which also populate those pages, you stand very little chance of seeing more pages indexed – you are more than likely seeing fewer pages indexed as Google compiles it’s clusters.

In fact, catalogs that work this way violate Google’s best practices as outlined in their Webmaster Guidelines. Google doesn’t publish all this info for fun – they are trying to help us help ourselves. They say very plainly:

“Don’t create multiple pages, subdomains, or domains with substantially duplicate content.”

If you feel like your site is suffering from this “clustering” of pages that seem to be duplicated, work with your webmaster to rectify the situation and then use Google’s Webmaster Central tools to request a re-evaluation of your domain.

If you just have 16 different search options that all lead back to the same item, don’t worry about it – one of those URLs will be indexed, and that’s all you need. You’ll want to monitor your SERPs so you can see which pages make the cut so you know how Google “sees you, and keep your site map up to date, but other than that, most of us have very little to fear from the Duplicate Content Penalty…