Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: A webcrawler for indexing a specific site

We intend to retrieve the data from a specified website (url may vary) and index it into our index. Currently using dotLucene as index, but have support for other engines.

The desired output from the web crawler should be reference/url, text from page and preferably an extracted date when possible.
We have concidered some opensource projects, but none match our requiremensts (don't have list of requirements available at this location)


Andreas

Roy Schestowitz skrev:
__/ [Andreas Ringdal] on Thursday 09 February 2006 10:36 \__

Does anyone know of a webcrawler I can use for indexing a specific site
into a local index?

Do you intend to use third-party software/Web service that is run by somebody else to generate indices and then deliver the, to you, e.g. as a download? Webcrawler is a company rather than more suitable terminology like a Web crawler. For poor descriptions, there may be poor answers, which is why it's worth asking before detailed and elaborate answers are given.

To generate indices locally, I know of Entropy Search, phpdig and htdig.
However, the format of the indices may be obscure (e.g. involve binaries)
rather than standardised (e.g. XML). Different search engines retain indices
differently (proprietary methods), I imagine, which make collaboration hard.


[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index