Re: SEO technology for Copyright Patrol?

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: SEO technology for Copyright Patrol?

Subject: Re: SEO technology for Copyright Patrol?
From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
Date: Tue, 28 Feb 2006 07:08:43 +0000
Newsgroups: alt.internet.search-engines
Organization: schestowitz.com / MCC / Manchester University
References: <43FFCC19.5858DD0B@luckymojo.com> <dtouef$oim$1@godfrey.mcc.ac.uk> <4403E1CD.D3E0BDEB@luckymojo.com>
Reply-to: newsgroups@xxxxxxxxxxxxxxx
User-agent: KNode/0.7.2

__/ [ catherine yronwode ] on Tuesday 28 February 2006 05:38 \__

> Roy Schestowitz wrote:
>> 
>> __/ [ catherine yronwode ] on Saturday 25 February 2006 03:16 \__
>> 
>> > Does anyone know of either a software package or a subscription service
>> > that employs search engine technology to roam the net looking for
>> > samples of copyright infringement / plagiarism?
>> 
                                  < snip />
>> >
>> >
>> > The subscription service would also do a whois lookup and find the email
>> > and street addresses of the domain owner and the domain host. It then
>> > (hand-supervised, probably) would auto-generate legal letters of
>> > complaint to the domain contact(s) and the contacts for te isp hosting
>> > the site. This information would go into a weekly log. As a service, it
>> > coud be programmed to auto-generate the exact forms requested by the
>> > major isps, such as yahoo. It would also presumeably continually revisit
>> > pages that it found had been infringed until the infringement was
>> > terminated.
>> 
>> That's a lot of automated traffic, which can raise many concerns. What,
>> for example, will you do when the offending site copies in part or
>> attributes the source using a link? This needs careful attention and
>> judgment by a human, preferably the victim. Also, imagine the load on
>> abuse@isp . net.
> 
> I am not talking about unasked for links to a site, or garbled scraping,
> merely direct, unauthorized copying of whole articles / major portions
> of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
> identical to your URl jkl."

You cannot quantify such things easily, just as you cannot merge two pieces
of similar text. Try, for example, to forge together two 'forks' of text
which have been worked on by different individuals. To use a familiar
example, have you ever mistakenly edited some older version of a text that
you worked on, only to *later* reveal that you had worked on an out-of-date
version? This is not the case with syntactic code, for instance, as it can
often be merged (CVS-like tool), much like isolated paragraphs in text,
which benefit from tools like 'diff'. Been there, (colleagues) done that.

The point of my babbling is that you can never measure such thing reliably,
let alone know their meaning. Statistics have their flaws. What if in your
text you cited (and linked to) an article and then provided some long quote?
A second site could do the same and unintentionally assimilate to your
content. The issue of copyrights and intellectual property suffers
tremendously nowadays. Bear in mind that apart from Google Groups, there are
at least half a dozen Web sites that copy the *entire* content of this
newsgroup, making it public.

> As for attention by a human, i would expect a design that offered me the
> option to send automated cease and desist letters (customizable) to the
> domain owner / tech rep and isp host copyright rep. Why would abuse@isp
> . net beome involved?

I was referring to the people employed by ISP to deal with abuse reports. If
they began to receive automated mail, there would be no barrier on the
amount of workload. This would also cast a shadow on abuse reports which are
submitted manually.

>> > Thinking farther ahead, if a partnership with google were made, google
>> > could agree to de-list sites that did not comply with the legal
>> > complaint procedures (e.g. those bootleg Rumanian sites). (I do not want
>> > to get off onto a tangent about google's own sopyright infringement
>> > issues; i know about them and i hope and trust that they will be
>> > resolved. This is just an idea, that's all, so please do not turn it
>> > into an excuse for google-bashing. Thanks.)
>> 
>> Why just Google? *smile* It promotes monoculture.
> 
>  Because they are sharp, good at what they do, and abjure evil.

I have got my own thoughts on that latest point. George Bush said he invaded
Iraq to save us all from WoMD. Everyone believes him at the time. Google
have done some evil things since their boasting of the mythical mantra in
the previous decade. Some of their actions were financially-motivated.

>> > I would pay a yearly fee for such a service.
>> >
>> > Does it exist?
>> 
>> I doubt it.
> 
> Could you design and market it?

*smile* I am not a businessman.

>> > If not, why not? (And can those restrictions be overcome?)
>> 
>> Such a tool would need to hammer a search engine quite heavily. How would
>> the search engine feel about it and what does the search engine have to
>> earn?
> 
> Well, a large (e.g. google) search engine could could charge money for
> the service.

This raises further questions. If that was the case:

-Could Google benefit from permitting plagiarism nests to exist?

-Would people truly waste and invest money in fighting evil?

This reminds me of the idea of pay-per-E-mail as means of preventing spam.

> Or, perhaps you could design a search engine to handle it in a way that
> does not hammer google.
> 
> For instance, my field is occultism / religion / spirituality  folklore.
> I supply your bot with keyword terms -- say 250 of them -- from my site.
> Your  bot goes to google and colllects the URLs for all sites ranking in
> the top 200 for all those terms.
> 
> Then i submit my domain name to your bot. Your bot takes 1 page at a
> time from my domain and searches all cached URLs it had retrieved
> earlier from google. It then moves to my next age and repeats the
> search.

*smile* You got greedy.

,----[ Quote ]
| Your bot takes 1 page at a time from my 
| domain and searches all cached URLs
`----

The practicality of search engines is based on the fact that you index sites
off-line. You can't just go linearly searching for duplicates. The least you
can do is find pointers to potential culprits by using the indices. I guess
I have missed you point though. If you are talking about surveying and
analysing top pages for a given search phrase, how far should you go? There
are infinitely many search phrases.

> That way it does not hammer google, but builds a database from customer
> keywords.
> 
> Your bot also updates its cache from google's top 200 once a month and i
> can change the keywords i want it to cache as well, on its next update.

That leaves gaps for misuse. Any control that is given to the user over
indexing, keywords and the like is bound to break. This must be the reason
why search engines ignore meta data and will never have second thoughts.

> How about that?
> 
> cat yronwode
> http://www.luckymojo.com/blues.html
> Blues Lyrics and Hoodoo

With kind regards,

Roy

-- 
Roy S. Schestowitz      | "Quote when replying in non-real-time dialogues"
http://Schestowitz.com  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  6:45am  up 1 day  2:56,  8 users,  load average: 0.42, 0.71, 0.62
      http://iuron.com - help build a non-profit search engine

References:
- Re: SEO technology for Copyright Patrol?
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index