Results 1 to 6 of 6
Can anyone help?? I have put * in the links so they aren't live.
I am stuck trying to get a complete webpage from the command line just like the ...
- 04-21-2009 #1Just Joined!
- Join Date
- Apr 2009
- Posts
- 18
Get a 'complete' webpage with wget or curl
Can anyone help?? I have put * in the links so they aren't live.
I am stuck trying to get a complete webpage from the command line just like the right click in firefox or epiphany does.
The page is completed with javascript and data is hidden when viewing source.
Here is an example (its not the actual site i want but it does the same result)
if you save in firefox the pageyou get a 350+ kb file with all the links in.Code:h**p://guida.tv.it/guidatv/grid.html
However with
I get an 80kb file without the links or the data that the javascript createsCode:wget -p - h**p://guida.tv.it/guidatv/grid.html >> /home/me/out.html
using perl fails
l"'perl -MWLP::Simple -e 'getprint "h**p://guidatv.sky.it/guidatv/grid.htm
with
.Can't locate WLP/Simple.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .).
BEGIN failed--compilation aborted
I have tried combinations of the below
cd cmdline && wget -nd -pHEKk h**p://www.pixelbeat.org/cmdline.html) Store local browsable version of a page to the current dir
wget -c h**p://www.example.com/large.file Continue downloading a partially downloaded file
wget -r -nd -np -l1 -A '*.jpg' h**p://www.example.com/ Download a set of files to the current directory
wget f*p://remote/file[1-9].iso/ FTP supports globbing directly
• wget -q -O- h**p://www.pixelbeat.org/timeline.html | grep 'a href' | head Process output directly
echo 'wget url' | at 01:00 Download url at 1AM to current dir
wget --limit-rate=20k url Do a low priority download (limit to 20KB/s in this case)
wget -nv --spider --force-html -i bookmarks.html Check links in a file
wget --mirror h**p://www.example.com/ Efficiently update a local copy of a site (handy from cron)
wget -r -l 1 h**p://www.nameofsite.com
wget -O - w*w.google.com | html2text > google.txt
I can get the java files that creates the page or mirror the site, but i dont want to do that. I just want to replicate saving the complete page as 1 file and get the data into a text file that i can strip and edit. I want to get data from several sources for a statistical program.
Does anyone know how to do this??
- 04-21-2009 #2Code:
wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --domains website.org \ --no-parent \ www.website.org
- 04-22-2009 #3Just Joined!
- Join Date
- Apr 2009
- Posts
- 18
Thanks for the reply,
However this is doing much the same as with wget -p option and only getting an 82kb file. If i download through save page in the browser such as firefox , i am getting 350kb+ of data with all the content in it. There must be something that saves inline javascript as well.
The file i am receiving is the same as save html only,not the save complete file.
As i read the command you posted it is ;
get all files needed,
dont get files already got,
get all parts needed to display the file,
save as html
follow links (convert on)
only write windows compatible file names
span hosts (website.org)
dont follow links outside parent folder
Although this reads as exactly what im after, it still doesn't do it. I'll continue searching
- 04-22-2009 #4
As far as I can tell, the difference is that firefox is downloading all the source files referenced in the page script, and wget is not. Against the site I checked at least, the basic html file was identical. Remove the domains line, and play with options like --span-hosts.
- 04-22-2009 #5Just Joined!
- Join Date
- Apr 2009
- Posts
- 18
It seems to be a problem with convert links. It is ignoring the whole section created by inline java script. If I grab these files too, the page still isnt availiable offline with the data.
Firefox is doing something that wget isn't doing
- 04-22-2009 #6Just Joined!
- Join Date
- Apr 2009
- Posts
- 18
I tried to make wget behave the same as firefox, ie with a large command line as if it was being used as the engine to a web page gui:
running this you can see it skipping javascript or css in the standard output, also misinterpreting some javascript # urls. I am at a loss now, I can get everything but the data!!Code:wget -p -k --referer="h**p://guidatv.sky.it" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv h**p://guidatv.sky.it/guidatv/grid.html
adding the -H switch and -E switch results in 1kb of extra data, but not the information wanted.
This also goes for people wanting to offline view these types of websites. It seems that somehow i will have to convert the links to non-relative after working out how they work.
There must be a simple method or each site may differ


Reply With Quote

