Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: wget

  • Subject: Re: wget
  • From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
  • Date: Sun, 08 Jan 2006 18:45:04 +0000
  • Newsgroups: comp.os.linux.misc
  • Organization: schestowitz.com / MCC / Manchester University
  • References: <1136628339.017608.12380@o13g2000cwo.googlegroups.com> <BJ-dnau-iuDJY13eRVnzvA@telenor.com>
  • Reply-to: newsgroups@xxxxxxxxxxxxxxx
  • User-agent: KNode/0.7.2
__/ [moma] on Sunday 08 January 2006 11:58 \__

> fritz-bayer@xxxxxx wrote:
>> Hi,
>> 
>> I'm trying to mirror the homepages, retrieved from a google search. The
>> problem is, that wget does not retrieve the homepages from the search.
>> 
>> Does anybody know how to use wget, so that it will download the
>> homepage for each search result returned from the search?
>> 
> An idea,
> 
> $ lynx --dump http://test.com    (<- place your web engine search here)
> 
>  ....
>  References
>    Visible links
>     1. http://test.com/servlet/com.test.servlet.account.Login
>     2. javascript:popWin('/phoenix/tour_home.htm')
>     3. http://test.com/phoenix/contact_general.htm
>     4. http://test.com/phoenix/so_1_2.htm
>     5. http://test.com/phoenix/so_1_3.htm
>     6. http://test.com/phoenix/so_1_4.htm
>     7. http://test.com/phoenix/so_1_5.htm
>     8. http://test.com/phoenix/so_2_2.htm
>     ....

$ wget http://test.com

With pages retrieved, e.g. a SERP (search engine results pages), you can also
extract the list of links, which you then pass to wget for traversal.

$ cat ~/serp | perl -ne '@url=m!(http://[^>"]+)!g;print "$_\n" foreach @url'
> ~/googleurls


> The listing begins after "References" and "Visible links" words. Study
> $ man lynx
> for other options.        ($ sudo apt-get install lynx)
> 
> Of course you can pipe the output to
>  | grep -e "^\ *[0-9]*\."    and
>  | grep "http://";  and
>  | uniq  | sort  and
>  | sed  etc.
> for further processing
> 
> and finally do wget -r -l1  -k -T5 SOME_HTTP_URL
> like this
> $ wget -r -l2  -k -T5   http://www.futuredesktop.org
> 
> which downloads the entire web-site.
> -k,  (--convert-links) will make links in downloaded HTML point to your
> local file.
> 
> $ man wget


I tend to use wget in the following fashion for downloads that are 'kind'.

wget -r -l1 -H -t1 -nd -N -np -A.htm,.html,.php -erobots=off -i
~/url_list.txt

The list should be newline separated

Hope it helps,

Roy

-- 
Roy S. Schestowitz
http://Schestowitz.com  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  6:35pm  up 29 days  1:46,  14 users,  load average: 0.51, 0.70, 0.66

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index