Re: bots

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: bots

Subject: Re: bots
From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
Date: Wed, 10 May 2006 17:05:41 +0100
Newsgroups: alt.internet.search-engines
Organization: schestowitz.com / MCC / Manchester University
References: <4460f79c$0$541$ed2619ec@ptn-nntp-reader01.plus.net> <_W78g.439472$7i1.117514@fe06.news.easynews.com> <5680887.ZiombDuED8@schestowitz.com> <dWn8g.449765$7i1.304461@fe06.news.easynews.com>
Reply-to: newsgroups@xxxxxxxxxxxxxxx
User-agent: KNode/0.7.2

__/ [ www.1-script.com ] on Wednesday 10 May 2006 16:47 \__

> Roy Schestowitz wrote:
> 
> 
>>> Google messed up their index cache during the last big update and
>>> now needs to catch up with Y! and Ask before users notice something
>>> bad happened. That's my Google theory. Due to Google's secrecy the
>>> number of theories out there almost equals to the number of people trying
>>> to crack that problem, googlers themselves included.
> 
> 
>> I have not heard this theory before. I haven't noticed any degradation
>> in terms of search results either. Suggesting that Google have fallen
>> behind is something that would make big headlines (same with studies that
> argue
> 
> Well, lucky you, my friend, lucky you! As for the rest of us (see threads
> on Webmasterworld such as this:
> http://www.webmasterworld.com/forum30/34228.htm or this:
> http://www.webmasterworld.com/forum30/34061.htm ) funny things are
> happening indeed. People (myself included) report massive page drop-outs,
> on a scale of 90-99% of the site being gone. Like I said, there are plenty
> of theories why but there is no question about the fact that something
> (bad) is going on.


That's quite a shocker. I noticed a large-scale change on my site around the
19/20th of April (positive change if that matters), but it reached an end
last week, for no apparent reason.


>> Google lost a top position), so it's probably just wishful thinking.
> 
> Well, for the better or worse they have ALREADY lost their top position on
> my sites! Yahoo had almost replaced the traffic that I lost from Google
> which is the only reason I tolerate exorbitant Yahoo Slurp! slurping rate
> ;-) 9.70GB on a single site since May 1st, 2006


Ouch! I believe you have your own dedicated server, fortunately. Maybe you
should get another one and spray red "Y!" over it. Sorry, I know it's no
place for sarcasm...


>> In operation, I am sure that they take into consideration all such risks
>> and replicate the data as required. Even "Big Daddy" seems to have been
>> corrected/re-aligned.
> 
> Replicate? Maybe, maybe not. It depends on whether we can trust their own
> words about running out of capacity (earlier thread here). For reliable
> replication you need twice the amount of storage which they don't seem to
> have enough even for the original data. Besides, have you ever had a
> database with the index(es) messed up? It could be pretty frustrating
> indeed. You know that the data is there but you cannot get to it (fast
> enough), which to the outside would look just like you simply lost that
> data.


Replication can be done more efficiently than that. Since much of the content
(that you care about) is textual, one could compress content as set it
aside. Compression algorithms can reduce natural text to about 10-20% of its
original size. I don't know how large their indices are (compared with full
text, i.e. Google Cache), but dumping of that data certainly does not depend
on the way it's stored/structured. If they don't back up their data and send
it to a remote location, they play a very risky game. I'm assuming that the
datacentres serves them as some arrays of redundancy /already/.


>>> I hope that pretty much covers most of it. Oh yeah, and there is
>>> always
>>> rogue bots out there, of course, trying your site for all kinds of
>>> exploits, so keep your shields up!
> 
> 
>> Shields up? I am not sure about exclusions.
> 
> Well, not in a way of putting everything into the robot.txt file of
> course. As a matter of fact, to keep ratbots (good term, I'll use it) on a
> leash DO NOT put anything sensitive into the robots.txt This is the first
> thing they check: what they are not supposed to access and then they
> surely try to access it to see what kind of informative errors they can
> generate.


I fully agree.

Best wishes,

Roy

-- 
Roy S. Schestowitz      |    "Have you compiled your kernel today?"
http://Schestowitz.com  |  Open Prospects   ¦     PGP-Key: 0x74572E8E
  4:55pm  up 12 days 23:52,  8 users,  load average: 0.73, 0.49, 0.52
      http://iuron.com - knowledge engine, not a search engine

Follow-Ups:
- Re: bots
  - From: www.1-script.com

References:
- Re: bots
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index