Re: google dynamic pages duplicate content

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: google dynamic pages duplicate content

Subject: Re: google dynamic pages duplicate content
From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
Date: Thu, 27 Apr 2006 16:05:18 +0100
Newsgroups: alt.internet.search-engines
Organization: schestowitz.com / MCC / Manchester University
References: <8ig1529npm7kg8hanfcpd8siuq0p45htgg@4ax.com> <114625576.VUFqIfeuqm@schestowitz.com> <tnj152tqionv44au0uio9s7er40q5kijki@4ax.com>
Reply-to: newsgroups@xxxxxxxxxxxxxxx
User-agent: KNode/0.7.2

__/ [ hug ] on Thursday 27 April 2006 15:05 \__

> Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> wrote:
> 
>>__/ [ hug ] on Thursday 27 April 2006 14:10 \__
>>
>>> Given a site where all pages have dynamic content, how does one
>>> prevent google from deciding that the same content with different
>>> session id's is duplicate?
>>
>>Setting aside the issue associated with incoming links (try to prevent
>>in-line session id's), why not use XML site maps? You only mentioned
>>Google. Various guidelines to Webmasters would probably be more helpful.
>>
>>Speaking of XML site maps, Matt Cutts announced today/last night that
>>through XML site maps, Webmasters can now get notification of panelties:
>>
>>        http://www.mattcutts.com/blog/notifying-webmasters-of-penalties/
>>
>>Best wishes,
>>
>>Roy
> 
> Hi, Roy.  I had thought about a site-map, the site-map that I'm
> currently using is automatically generated and it shouldn't be that
> tough to generate an xml version (presumably, I know nada about xml).
> 
> In the google webmaster page it said that a robots.txt should be used
> to keep google from looking at anything with a session id.  I'm not
> quite sure how to address that, it seems conflicting since all my
> pages are dynamic.  I guess I should specify that nothing with a .PHP
> extension should be looked at, does that make sense?

Borek recently reminded me that Google will honour wildcards in robots.txt.
This is not something that was specified in the accepted protocol/standard
as far as I can tell. In fact, wildcards in robots.txt are officially
discourages, unless you count incomplete directory levels as wildcards (e.g.
disallow "/projects" also excludes "/projects/index.htm" and
"/projects/widgets/").

Google are said to have made the exception. This means that you can specify a
pattern which matches URL's with session id's, and URL's with id's _ONLY_.
How you can do it will depend on your site/CMS.

Best wishes,

Roy

-- 
Roy S. Schestowitz      |    "Did anyone see my lost carrier?"
http://Schestowitz.com  |    SuSE Linux     ¦     PGP-Key: 0x74572E8E
  4:00pm  up 5 days  1:11,  9 users,  load average: 1.55, 0.95, 0.70
      http://iuron.com - help build a non-profit search engine

References:
- Re: google dynamic pages duplicate content
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index