__/ [Darren Tipton] on Sunday 23 October 2005 10:49 \__
> On Tue, 18 Oct 2005 13:14:09 +0100, Roy Schestowitz
> <newsgroups@xxxxxxxxxxxxxxx> wrote on the topic of "Knowledge Engines - A
> Formal Proposition":
>>All the all, search engines at present encourage link-related spam and
>>content-related spam. In worse scenarios, their backlinks-based algorithms
>>lead to rise in sponsored listings, whereas our natural incentive is to
>>prefer what would "work best for us", not what got recommended by automat-
>>ed tools. These tools, which work at a shallow level without understand-
>>ing, opt to prioritise large corporations with money to be spent on good
>>listings and inbound links.
> I read on the BBC website yesterday...
> "If consumers see a perceptible quality difference [with rival search
> engines], they will disappear," admits Mr Arora.
Someone in this group has recently pointed out better relevancy in Yahoo. I
personally disagree, but fingers point at different directions, which raises
> We presume he means, if the search results are not relevent to what someone
> is searching for...they will look elsewhere.
This must be the reason why the majority of people opt for Google despite the
default homepage, which tends to be msn.com.
>>The fundamental approach to tacking the problem is not overly complicated.
>>The goal is certainly feasible, while the resources to make it practical
>>are the primary barrier.
>>Since Iuron is an Open Source project, rapid assemblage and construction
>>of the libraries would be rapid, making use of existing projects that fall
>>under the General Public Licence (GPL). In return, Iuron will provide a
>>potentially distributed environment, wherein any idle computer across the
>>world can assist crawling and report back to a main knowledge repository.
>>Think of it as a public-driven reciprocal effort to process and then cen-
>>tralise human knowledge.
> At this moment in time, the Majestic12 project is doing this (which I
> believe is also open source). Using distributed computing power to crawl
> the web, using a C# based spider. It seemed to devour a few thousand pages
> on one of my sites pretty damn quickly.
I remembered to acknowledge Majestic12 yesterday and I suspect it was you who
< http://schestowitz.com/Weblog/archives/2005/10/22/collaborative-crawl/ >
pointed out the site in the first place. There should be no "lust for
images", at least not initially. Having mentioned images, I do my research
in the field of computer vision, so there might be provision for image
analysis, classification and labelling too. Machine learning is quite
well-developed in that respect. Again, crawling should never rely on
captions as these can be intentionally deceiving.
I once had this idea of allowing users to describe an image that they seek
and then <
fetch most relevant images from the Web.
A computer will have a reliable understanding of image contents ?
understanding that surpasses the human eye and mind. Want a photo that which
a Labrador drinking water from a fountain on a sunny day? Want a descriptive
verbal interpretation of a given distorted image? This will probably be
practical in the distant future. It is a machine learning/pattern
recognition task taken to extremity.
> How do you intend to split the computing work load on this.. in that. Do
> you intend the spider simply to crawl large numbers of sites for "all"
> data, then let the user interface determine fact applicability? Or, do you
> want the spider to extract facts?
If you allow humans to intervene, the process become labour-intensive and
subjective. It's better to make the entire framework autonomous and
self-maintaining. Lies and mistakes, however, are an issue, especially urban
legends that tend to repeat themselves.
As for load management, one can always set hit intervals. The need to crawl
pages time after time is only crucial when you seek up-to-the-minute facts
that can be encouraged instantly. Thus, you can have some analysis of
momentary trends... much like tag clouds.
> I'm interested to know how you intend to discern a fact from a web page
The sources can be selective at the start (e.g. Wikipedia) and the crawlers
then extend with caution. There are words that characterise factual science,
for example, and tell that apart from a 'breakfast and lunch blog'.
I would toss a number 'off the sleeve' and say that the proportion of pages
which are publicly-available spam is worrying. Ping traffic is possibly 80%
spam, so I believe the same might hold for content spam. Much of it just
doesn't get crawled. Spammers can produce 100 times the number of pages of
genuine sites if they put their minds to it. The secret lies in filtering,
which in itself is a machine learning task.