Re: A webcrawler for indexing a specific site

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: A webcrawler for indexing a specific site

Subject: Re: A webcrawler for indexing a specific site
From: Andreas Ringdal <collector@xxxxxxxxxx>
Date: 9 Feb 2006 14:16:58 +0100
In-reply-to: <dsfeeh$rva$1@godfrey.mcc.ac.uk>
Newsgroups: alt.internet.search-engines
References: <43eb1b2a@news.wineasy.se> <dsfeeh$rva$1@godfrey.mcc.ac.uk>
User-agent: Thunderbird 1.5 (Windows/20051201)
Xref: news.mcc.ac.uk alt.internet.search-engines:76809

We intend to retrieve the data from a specified website (url may vary) and index it into our index. Currently using dotLucene as index, but have support for other engines.

The desired output from the web crawler should be reference/url, text from page and preferably an extracted date when possible. We have concidered some opensource projects, but none match our requiremensts (don't have list of requirements available at this location)

Andreas

Roy Schestowitz skrev:

__/ [Andreas Ringdal] on Thursday 09 February 2006 10:36 \__

Does anyone know of a webcrawler I can use for indexing a specific site
into a local index?


Do you intend to use third-party software/Web service that is run by somebody
else to generate indices and then deliver the, to you, e.g. as a download?
Webcrawler is a company rather than more suitable terminology like a Web
crawler. For poor descriptions, there may be poor answers, which is why it's
worth asking before detailed and elaborate answers are given.

To generate indices locally, I know of Entropy Search, phpdig and htdig.
However, the format of the indices may be obscure (e.g. involve binaries)
rather than standardised (e.g. XML). Different search engines retain indices
differently (proprietary methods), I imagine, which make collaboration hard.

References:
- Re: A webcrawler for indexing a specific site
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index