Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> wrote:
> __/ On Friday 26 August 2005 11:29, [John Bokma] wrote : \__
>>> Sorry, but I must disagree. Let us say that T is the original page
>>> and F (false) is the copy.
>>> If F = T + A where A is some extra content, then you have problems
>> Not really, you can define similarities based on sentences, words,
>> etc. You don't have to look for exact matches. Similar is close
> ...and very computationally-expensive.
Today, maybe. Tomorrow? Who knows. I can imagine that there is something
like the soundex algorithm, but then for sentences, or even paragraphs.
E.g. a code or vector can be calculated for each paragraph. This can be
done when a page is fetched. Fetching and comparing those vectors within
a database is not that much harder then the same content check which is
> Search engines are having a
> hard time indexing billions of pages and picking up key words. Now you
> ask them to calculate similarities in a graph with billions of nodes?!
Isn't that already happening? Duplicate content? That step only needs to
be refined. Maybe in 2 years it's as expensive as the current duplicate
check, so it's certainly within reach, and maybe even already developed.
>> I am sure there has already been a lot of research done. For example,
>> students copy papers written by others.
> Yes, I know, but people mocked it for being unreliable. Besides, you
> can easily run filters that will do some permutations and replace
> words with proper equivalents. Brute force would do the job.
Yes, and hence it will get harder and harder.
>>> To a black hat SEO it would be no problem to automate this and
>>> deceive the search engines. it is much easier to carry out a robbery
>>> than it is for the police to spot the crook in a town of millions.
>> You don't do exact matches in cases like this, just fuzzy matches.
> Using that analogy again, that's like doing a house-to-house search
> and questioning all the residents.
But you have already eliminated all houses that certainly have nothing
to do with it. The trick is to minimize both the number of false
positives and negatives (hence improving the certainty). Those
techniques are used for spam, for virus detection, etc. And I am sure
they will become more and more important to stop things like lyric
sites, usenet archives, and free content cloning.
John Perl SEO tools: http://johnbokma.com/perl/
Experienced (web) developer: http://castleamber.com/
Get a SEO report of your site for just 100 USD: