Re: Google's own robots.txt

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: Google's own robots.txt

Subject: Re: Google's own robots.txt
From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
Date: Tue, 02 May 2006 20:00:59 +0100
Newsgroups: alt.internet.search-engines
Organization: schestowitz.com / MCC / Manchester University
References: <v6L5g.269424$117.165692@fe02.news.easynews.com> <2673965.T0m08ACkji@schestowitz.com> <A_M5g.218439$zy2.106093@fe08.news.easynews.com>
Reply-to: newsgroups@xxxxxxxxxxxxxxx
User-agent: KNode/0.7.2

__/ [ www.1-script.com ] on Tuesday 02 May 2006 18:51 \__

> Roy Schestowitz wrote:
> 
>>
>> It's in the specifications as far as I know. It always has been (at
>> least
>> ever since I become familiar with robots.txt). Maybe it is used as an
>> indication/emphasis of availability or somewhat of a directive like XML
>> sitemaps.
>>
> 
> Well, this is not a part of the robots exclusion protocol unless Google
> thinks they can make their own standards now (every big company thinks so,
> so not a big surprise). That's why it's called Robots *Exclusion*
> protocol: anything that's not specifically excluded is included! Hence no
> need for "Allow" - if it's not disallowed, then it's assumed allowed.
> (more at http://www.robotstxt.org/wc/exclusion.html )
> 
> I agree, it would be handy to disallow a wildcard section of a site and
> then only allow certain resources, but until today I did not think it was
> possible.
> 
> So, is it a novelty Google standard or I missed last 10 years of
> robots.txt standard development?

I appreciate your correction. I never realised that this was infringing the
widely-accepted protocols, which is _exactly_ what yet another monopoly has
been doing for many years and got away with (HTML, peer connectivity, codecs
and so forth). Fortunately, much of this has been reverse-engineered and
made Open Source/GPL'd.

While robots.txt violations are the topic de jour, did you know that Google
also support wildcards in robots.txt. One of the strict rules in any
robots.txt FAQ and HOWTO is that wildcards are *not* allowed (unless you
call "disallow: /rss/" a hierarchy-based wildcard). I once nagged about
Google sitemaps, for similar reasons...

http://schestowitz.com/Weblog/archives/2005/07/31/rss-sitemaps/

Best wishes,

Roy

-- 
Roy S. Schestowitz      |    "I blame God for making me an atheist"
http://Schestowitz.com  |  Open Prospects   ¦     PGP-Key: 0x74572E8E
  7:55pm  up 5 days  2:52,  11 users,  load average: 0.16, 0.60, 0.54
      http://iuron.com - knowledge engine, not a search engine

References:
- Re: Google's own robots.txt
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index