Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: Mirroring website without wget?

__/ [ SteveN ] on Wednesday 22 February 2006 10:35 \__

> Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> wrote in
> news:dthbe5$k6v$4@xxxxxxxxxxxxxxxxx:
>>> Are there any other tools that I can use to mirror a website without
>>> using wget (which doesn't seem to work).
>> 
>> scp, ftp, among more. Are you *mirroring* a site or just *scraping*
>> it? Is it /your/ site?
> 
> No, it's not my site, and it doesn't have ftp access.  I want to copy
> images from it, and although I could browse it manually, and copy the
> images from my browser's cache, it is just so time-consuming.


It doesn't sounds as though what you are trying to do is legitimate or
ethical.


>>> The site needs a username and password (which I have) and I copied
>>> the cookies from Firefox to the wget directory after properly logging
>>> in. Firefox, Opera, and for that matter Internet Explorer have no
>>> problems once I have logged in, but it seems wget is getting confused
>>> by some javascript nastiness that sends it off-site.
>> 
>> Maybe grabbers are denied as a matter of principle. Maybe some
>> user-agent sniffing is involved, in which case you must spoof.
> 
> Yes, I am spoofing, as Mozilla - most of the pages work, but when it comes
> to a 'protected' page, it seems to ignore the cookies I already have, and
> then tries to force me to log in again (I think!)  I'll have to examine
> ethereal to see if I can see anything.


Page authentication is not something which I know how to deal with, not
without exploring the intricacies of wget. It might not offer that facility
at all.


>>> I *think* what I am asking is if there is an extension to Firefox
>>> which allows it to be used as a mirroring tool?  Googling just seems
>>> to give me lots of 'mirrors of firefox' rather than what I am after.
>> 
>> 
>> There are mirroring tools for Web sites that are owned by the
>> 'mirrorer'. I syndicate Firefox plug-ins on a daily basis and I have
>> not come across such an extension.
> 
> Oh, well ...
> 
>> There are Google scrapers in the wild, so you might be able to re-use
>> them. They should be easy to identify on the Net.
> 
> I don't think I need the scrapers yet.
>  
>> Hope it helps,
> 
> Thanks


wget -r -l3 -H -t3 -nd -N -np -N -np -A.jpg,.jpeg,.gif,.png,.bmp -erobots=off
-i list_of_sites.txt

Hope it helps,

Roy

-- 
Roy S. Schestowitz      |    "I regularly SSH to God's brain and reboot"
http://Schestowitz.com  |    SuSE Linux     |     PGP-Key: 0x74572E8E
 11:50am  up 5 days  0:09,  8 users,  load average: 1.06, 0.75, 0.59
      http://iuron.com - Open Source knowledge engine project

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index