__/ [ catherine yronwode ] on Tuesday 28 February 2006 22:52 \__
Right *pulls sleeves*... here we have a lengthy post with plenty of
information to digest. I read it through quickly, but I had to procrastinate
an answer due to the dullest chores conceived (booking for 6, conference at
Washington next month). This kept me away from UseNet and my usual Web
activities <sarcasm type="self-derogatory"> and I can sense the withdrawal
symptoms</sarcasm>. *hand shake*
As a foreword, I think you have an excellent idea, but I doubt its
practicability and the rigour one could invest in it. I will now try to
comment as I go along. Here we go...
> Roy Schestowitz wrote:
>> > I am not talking about unasked for links to a site, or garbled scraping,
>> > merely direct, unauthorized copying of whole articles / major portions
>> > of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
>> > identical to your URl jkl."
>> You cannot quantify such things easily, just as you cannot merge two
>> pieces of similar text. Try, for example, to forge together two 'forks' of
>> text which have been worked on by different individuals. To use a familiar
>> example, have you ever mistakenly edited some older version of a text that
>> you worked on, only to *later* reveal that you had worked on an out-of-
>> date version? This is not the case with syntactic code, for instance, as
>> it can often be merged (CVS-like tool), much like isolated paragraphs in
>> text,which benefit from tools like 'diff'. Been there, (colleagues) done
> Okay, i see i was not clear enough. I am looking for a service that
> proves a semi-automated version of what i have successfully done by hand.
No job should be done by hand (don't interpret this in a sexual context,
please). Ideally, all should be self-sustaining and self-managing, which
brings up some social issues that we are yet to see in the future.
> VERSION ONE -- GOOGLE-BASED
> 1) I submit my top 250 keywords to your web interface. I also submit one
> ten-word sentence fragment (a "check phrase") for each URL i am
> protecting. You may set me a limit of numbers of pages i can protect for
> a iven amount of fee. Let's say you alow me 100 pages. My 250 keywords,
> 100 URLs, and the accompanying 100 check phrases are permanently logged
> at your site (but can be changed by an "edit" function). The check
> phrase for each URL is MY responsilibility to choose and must be way
> unique. Like, say (real example):
> "contingent of spiritually-inclined folks who will not use common"
> which is from
> 2) Your service bot goes to google (or -- see below for VERSION TWO, in
> which it does not go to google, but rather to a google-geneated
> "personal cache) and it searches on the 100 check-phrases. In the real
> life example above, my check-phrase turns up 4 matches. Two are at my
> own domain (one is a weidly garbled URL that i have no idea what it's
> about, but probably some wacky symbolic link thingie that my husband
> screwed around with) and thus are eliminated -- and the other 2 are not
> at my domain and thus are potential cases of illegal copyright
> infringement, and are logged at your web-based interface so i can view
Okay, so far so good. However, bear in mind you intend to run a *service*
here. Even 100 queries become a heavy load if you have 1,000 thirsty
customers that make the service affordable (or at the least self-covering).
I recently read that gada.be is refused access by del.icio.us, which shows
that the whole Web 2.0 'spirit' does not work in practice. For Google to
give a share of their bandwidth and computer power it should take convincing
through negotiations. Is removal of duplicates helpful to Google? Yes.
Still, this cannot be done behind their back and at their own expense. The
acceptance of complaints and subsequent review are also expensive. These are
manual. You attempt to automate something you do manually, but in turn, it
can raise the amount of manual workload over at the Googleplex.
> 3) The bot does a whois lookup on the two infringing domains -- in this
> real-life example:
> ausetkmt. com
> freewill.tzo. com/~callista
> and it logs the data in your web-based interface so i can view it.
> 4) The bot obtains, from a cache, three copies of a customizable
> "friendly (stage one) complaint letter and drafts them to each domain:
> one to the owner, one to the owner's tech contact (in my experience
> owners who plagiarize often claim inability to delete files as a reason
> to avoid action; this works around their excuse-making), and one to the
> domain's isp.
...provided that all details are recorded consistently and the appropriate
fields can be pulled after parsing. In theory, you need access to the ICANN
databases and not rely on Web interfaces that make fetching of such
information rather tricky and prone to breakage (e.g. changes to interface).
> 3) The bot generates a web-page based alert, displaying all information
> about the infringing sites and notifying me that the draft "friendly"
> complaints are ready to be sent.
> 6) At your web site, i can perform a personal check of the pages --
> similar to the Wikipedia "diff" function and displayed the same way
> (side by side) -- before i commit to sending the "friendly" complaint
> letters or abort the send.
> 7) If i decide to send the "friendly" complaint letters, this action is
> logged and dated and displayed at your web interface for my future
> 8) There is a 'tickler" function that makes a re-check of any site to
> which i have sent a complaint at one-week intervals. This informs me at
> the web site whether the infringement is still up.
> 9) Decision fork:
> 9A) If the infringing page is gone, it is marked (in red) "Page No
> Longer Online" but it stays in the system for access anytime i wish to
> re-check my "History" with that domain (or my "History" in general).
> 9B) If the infringing page is still there, I am offered the option to
> send a strongly worded "legal" (stage two) complaint and to print two
> hard copies to be sent to the contact addresses for domain owner and
> isp. (Subsidiary idea: keep the snail-mail copyright department
> addresses of major isps -- and any isps ever cntacted by the system --
> on file, for they are uusally difficult to track down and it would save
> the client time having to look them up.)
> 10, 11, 12) Repaeat steps 7, 8, and 9 for the "legal" complaint.
> 13) If the "legal" complaint generates no response, i am given the
> option of sending a fully documented letter (with all relevant date
> stamps and so forth from your service's histry records) to google
> informing them of the infringement and requsting them to de-list the
> offending URL (or domain) from their SERPs. (Side-note: if the service
> is well-publicized, google will probably agree to honour their
> complaints. If three such services exist, they can form an Association
> and gogle will definitely have to deal with them.)
> 14) This ends the service's responsibilities. For any further actions, i
> must hire a lawyer.
I am surprised that you are willing to go as far as that. Unless a site
copies your /entire/ content, would it ever be worth the time? (rhetorical)
I guess it depends on how much revenue/self pleasure the site generates.
>> The point of my babbling is that you can never measure such thing
>> reliably, let alone know their meaning. Statistics have their flaws. What
>> if in your text you cited (and linked to) an article and then provided
>> some long quote?A second site could do the same and unintentionally
>> assimilate to your content. The issue of copyrights and intellectual
>> property suffers tremendously nowadays. Bear in mind that apart from
>> Google Groups, there are at least half a dozen Web sites that copy the *
>> entire* content of this newsgroup, making it public.
> This is all true, but not relevant. I am talking about webmasters who
> build sites competing with my site's SERPs by deliberate copyright
> infrinngement of my own copyright protected web pages. See above
OK, I now understand better.
>> > As for attention by a human, i would expect a design that offered me the
>> > option to send automated cease and desist letters (customizable) to the
>> > domain owner / tech rep and isp host copyright rep. Why would abuse@isp
>> > . net beome involved?
>> I was referring to the people employed by ISP to deal with abuse reports.
>> If they began to receive automated mail, there would be no barrier on the
>> amount of workload. This would also cast a shadow on abuse reports which
>> are submitted manually.
> The letters would be submitted manually. See above.
> [Google and Evil discussion tabled for anther thread -- an interesting
> subject and one worthy of conversation, but off-topic here.]
>> >> > I would pay a yearly fee for such a service.
>> >> >
>> >> > Does it exist?
>> >> I doubt it.
>> > Could you design and market it?
>> *smile* I am not a businessman.
> Could you design it?
When I started iuron.com, I believed that I had an idea that would work. I
still believe that. I spoke to a distinguished professor in the field to get
some pointers. However, I can't believe I can ever afford the time. I lack
the desire too. Sometimes I wonder how many aspects of life (personal and
professional) I will neglect. I'm like a toddler choosing a different shiny
object every now and then. It's worrisome. even my exercise regime has
dropped to 4 times a week. It is the lowest level in 10 years, since I got
started and I have little passion for it. I think it indirectly answers your
question, in the most candid way.
>> >> > If not, why not? (And can those restrictions be overcome?)
>> >> Such a tool would need to hammer a search engine quite heavily. How
>> >> would the search engine feel about it and what does the search engine
>> >> have to earn?
>> > Well, a large (e.g. google) search engine could charge money for
>> > the service.
>> This raises further questions. If that was the case:
>> -Could Google benefit from permitting plagiarism nests to exist?
>> -Would people truly waste and invest money in fighting evil?
> Authors and businesses invest in fighting copyright and trademark
> infringement all the time. I spend many hours per year at the task. A
> semi-automated web-based system would save me 100-plus hours per year
> and a great deal of frustration. I would pay 250 dollars per year to
> subscribe, maybe more. A sliding scale of pricing could allow for
> different levels of examination based on varying the number of client
> keywords / number of client pages handled.
>> This reminds me of the idea of pay-per-E-mail as means of preventing spam.
> I don't see the similarity. I am talking about a web-based service to
> which i could subscribe that would allow me to patrol the web for
> copyright infringments.
True, I see now.
>> > Or, perhaps you could design a search engine to handle it in a way that
>> > does not hammer google.
>> > For instance, my field is occultism / religion / spirituality folklore.
>> > I supply your bot with keyword terms -- say 250 of them -- from my site.
>> > Your bot goes to google and colllects the URLs for all sites ranking in
>> > the top 200 for all those terms.
>> > Then i submit my domain name to your bot. Your bot takes 1 page at a
>> > time from my domain and searches all cached URLs it had retrieved
>> > earlier from google. It then moves to my next age and repeats the
>> > search.
>> *smile* You got greedy.
>> ,----[ Quote ]
>> | Your bot takes 1 page at a time from my
>> | domain and searches all cached URLs
>> The practicality of search engines is based on the fact that you index
>> sites off-line. You can't just go linearly searching for duplicates. The
>> least you can do is find pointers to potential culprits by using the
>> indices. I guess I have missed you point though. If you are talking about
>> surveying and analysing top pages for a given search phrase, how far
>> should you go? There are infinitely many search phrases.
> That is true -- and that is why, when you spoke of "hammering google," i
> theorized another, less google-intensive way to do the job. Here is how
> i envision it working with a non-google-hammering web interface, relying
> only peripherally on google to generate the initial batch of
> VERSION TWO -- PERSONAL CACHE BASED
> 1) I submit my top 250 keywords to your web interface. I also submit one
> ten-word sentence fragment (a "check phrase") for each URL i am
> protecting. My 250 keywords, 100 URLs, and the accompanying 100 check
> phrases are permanently logged at your site (but can be changed by an
> "edit" function). The check phrase for each URL is MY responsilibility
> to choose and must be way unique.
> 2) Your bot goes to google only ONCE for each those 250 kewords, finds
> the top 200 results for each keyword, and caches them offline. 200 x 250
> = 50,000 pages -- but there will be duplications of common terms, so,
> with duplication eliminated, we might theorize that those 50,000
> potential URLs will actually reduce down to 25,000 pages. Whatever the
> number, that would be my personal index cache at your service.
Cache is a non-real-time element. I think the load would remain higher than
> 3) If a trial proved that the above numbers were unworkable, we coud
> limit my input to 100 keywords x top 100 results at google per keyword.
> This would result in 10,000 pages, which, with duplication eliminated,
> might reduce to 5,000 pgaes.
The numbers still appear quite steep. Before taking this seriously, I think
negotiation with Google is worthwhile (as well as me submitting my thesis
and have it set out of the way). If you are serious about this, I have
contact with a Google manager, so I could maybe attempt a proposal...
> 4) Levels of payment could be arranged for a 100 / 100 search or a 250 /
> 200 search or whatever other arrangements you deemed feasible. Thus
> clients would pay for the amount of breadth and depth of search -- and
> the amunt of cache space at your end -- that they required.
Web-based and 'cache' are somewhat conflicting. I thought it was worth
pointing out. The Web browser cannot read from or write to physical media.
> 5) I could, at a specified interval -- say once a month -- rewrite my
> 250 (or 100) keywords. In any case, the 200 (or 100) top results for
> each keyword would be automatically updated at google once a month (or
> every three months, if that is easier.)
> 6) When i submit my ten-word check-phrases, your bot does not return to
> google, but rather searches my personal index cache.
> 7) I believe that this system would be sufficient (and better than
> hammering google) because my MAJOR goal is to eliminate successful
> competitors for SERPs, and other, less successful plgiraists, are of far
> lesser concern. A button at the web site that initiates a once-a-year
> sweep of all google cached pages (as opposed t all of my personal
> indexed cache pages at your service) would be sufficient to eliminate
> the low-level plagiarists.
This leads me to thinking: what about Google's own detection of duplicates?
One could argue that 'interference' from a third-party is undesirable
>> > Your bot also updates its cache from google's top 200 once a month and i
>> > can change the keywords i want it to cache as well, on its next update.
>> That leaves gaps for misuse. Any control that is given to the user over
>> indexing, keywords and the like is bound to break. This must be the reason
>> why search engines ignore meta data and will never have second thoughts.
> I disagree. This is a service that the user pays for and as long as the
> interface is clear, clean, and functional, it is the service's
> responsibilityto automate certain tasks and the user's responsibility to
> authorize the implementation the semi-automatized tasks.
What about misuse of the service? Such as signing up by spammers with the
goal of intercepting competitive sites?
> I really do think this is a useful commerical service just waiting to
> happen. I look forward to your further comments, as you are one of the
> few people i know in the world who can discuss these matters at all, as
> well as being kind to those who, like me, are merely logical thinkers
> and not actually computer programmers.
> cat yronwode
I am flattered by your words, Cat.
Roy S. Schestowitz | "Disk quota exceeded; sig discontinued"
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
12:45pm up 8:23, 4 users, load average: 0.24, 0.32, 0.44
http://iuron.com - help build a non-profit search engine