Re: Detecting Content Mirrors

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: Detecting Content Mirrors

Subject: Re: Detecting Content Mirrors
From: John Bokma <john@xxxxxxxxxxxxxxx>
Date: 26 Aug 2005 10:29:37 GMT
Newsgroups: alt.internet.search-engines
Organization: Castle Amber - software development
References: <dekrj8$47m$1@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com> <deksi9$2c6l$1@godfrey.mcc.ac.uk> <dekura$crp$1@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com> <430e1f91$0$18636$14726298@news.sunsite.dk> <dem2vi$2ppa$3@godfrey.mcc.ac.uk> <430ec59c$0$18639$14726298@news.sunsite.dk> <demhct$1438$1@godfrey.mcc.ac.uk> <430eca73$0$18648$14726298@news.sunsite.dk> <demibq$148p$2@godfrey.mcc.ac.uk> <Xns96BE22CE9993Ecastleamber@130.133.1.4> <demk58$14sp$1@godfrey.mcc.ac.uk>
User-agent: Xnews/5.04.25
Xref: news.mcc.ac.uk alt.internet.search-engines:65348

Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> wrote:

> __/ On Friday 26 August 2005 09:27, [John Bokma] wrote : \__
> 
>> Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> wrote:
>> 
>>> ___/ On Friday 26 August 2005 08:53, [Mikkel Møldrup-Lakjer] wrote :
>>> \___
>>> 
>>>> "Roy Schestowitz" <newsgroups@xxxxxxxxxxxxxxx> skrev i en
>>>> meddelelse news:demhct$1438$1@xxxxxxxxxxxxxxxxxxxx
>>>>>
>>>>> It sure gets copied rather quickly. Here is an example I found by
>>>>> doing some
>>>>> searches...
>>>> 
>>>> At least the first one mentions its sources, something which seems
>>>> to have been left out on the other two. I think most pages that use
>>>> the Factbook don't like to mention their source. Now why would that
>>>> be? 
>>> 
>>> An agent would come home and they would... *ahem*.. 'disappear'. I
>>> bet you were referring to the lack of reliability though... not to
>>> mention lack of intellect or creativity.
>>> 
>>> This shows the need for penalty on duplicates. But how can this be
>>> done automatically if the copier does not even acknowledge (link) to
>>> the source?!?!?! It's a hit-or-miss situation.
>> 
>> Nah, I am sure there are ways to see of a (part) of page A is also on
>> page B.
> 
> Sorry, but I must disagree. Let us say that T is the original page and
> F (false) is the copy.
> 
> If F = T + A where A is some extra content, then you have problems

Not really, you can define similarities based on sentences, words, etc. 
You don't have to look for exact matches. Similar is close enough.

I am sure there has already been a lot of research done. For example, 
students copy papers written by others.

> To a black hat SEO it would be no problem to automate this and deceive
> the search engines. it is much easier to carry out a robbery than it
> is for the police to spot the crook in a town of millions.

You don't do exact matches in cases like this, just fuzzy matches.

-- 
John                       Perl SEO tools: http://johnbokma.com/perl/
                 Experienced (web) developer: http://castleamber.com/
Get a SEO report of your site for just 100 USD:
              http://johnbokma.com/websitedesign/seo-expert-help.html

Follow-Ups:
- Re: Fuzzy Matches for Content Mirrors
  - From: Roy Schestowitz

References:
- Re: Great source for content
  - From: Roy Schestowitz
- Re: Great source for content
  - From: T.J.
- Re: CIA Factbook Errors
  - From: Roy Schestowitz
- Re: CIA Factbook Mirrors
  - From: Roy Schestowitz
- Re: CIA Factbook Mirrors
  - From: Roy Schestowitz
- Re: Detecting Content Mirrors
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index