Find the answer to your Linux question:
Results 1 to 6 of 6
Can anyone help?? I have put * in the links so they aren't live. I am stuck trying to get a complete webpage from the command line just like the ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Apr 2009
    Posts
    18

    Get a 'complete' webpage with wget or curl


    Can anyone help?? I have put * in the links so they aren't live.

    I am stuck trying to get a complete webpage from the command line just like the right click in firefox or epiphany does.

    The page is completed with javascript and data is hidden when viewing source.

    Here is an example (its not the actual site i want but it does the same result)

    if you save in firefox the page
    Code:
    h**p://guida.tv.it/guidatv/grid.html
    you get a 350+ kb file with all the links in.

    However with
    Code:
    wget -p - h**p://guida.tv.it/guidatv/grid.html >> /home/me/out.html
    I get an 80kb file without the links or the data that the javascript creates

    using perl fails
    perl -MWLP::Simple -e 'getprint "h**p://guidatv.sky.it/guidatv/grid.htm
    l"'
    with
    Can't locate WLP/Simple.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .).
    BEGIN failed--compilation aborted
    .


    I have tried combinations of the below
    cd cmdline && wget -nd -pHEKk h**p://www.pixelbeat.org/cmdline.html) Store local browsable version of a page to the current dir
    wget -c h**p://www.example.com/large.file Continue downloading a partially downloaded file
    wget -r -nd -np -l1 -A '*.jpg' h**p://www.example.com/ Download a set of files to the current directory
    wget f*p://remote/file[1-9].iso/ FTP supports globbing directly
    wget -q -O- h**p://www.pixelbeat.org/timeline.html | grep 'a href' | head Process output directly
    echo 'wget url' | at 01:00 Download url at 1AM to current dir
    wget --limit-rate=20k url Do a low priority download (limit to 20KB/s in this case)
    wget -nv --spider --force-html -i bookmarks.html Check links in a file
    wget --mirror h**p://www.example.com/ Efficiently update a local copy of a site (handy from cron)
    wget -r -l 1 h**p://www.nameofsite.com
    wget -O - w*w.google.com | html2text > google.txt

    I can get the java files that creates the page or mirror the site, but i dont want to do that. I just want to replicate saving the complete page as 1 file and get the data into a text file that i can strip and edit. I want to get data from several sources for a statistical program.

    Does anyone know how to do this??

  2. #2
    Linux Guru reed9's Avatar
    Join Date
    Feb 2009
    Location
    Boston, MA
    Posts
    4,651
    Code:
    wget \
         --recursive \
         --no-clobber \
         --page-requisites \
         --html-extension \
         --convert-links \
         --restrict-file-names=windows \
         --domains website.org \
         --no-parent \
    www.website.org

  3. #3
    Just Joined!
    Join Date
    Apr 2009
    Posts
    18
    Quote Originally Posted by reed9 View Post
    Code:
    wget \
         --recursive \
         --no-clobber \
         --page-requisites \
         --html-extension \
         --convert-links \
         --restrict-file-names=windows \
         --domains website.org \
         --no-parent \
    www.website.org
    Thanks for the reply,

    However this is doing much the same as with wget -p option and only getting an 82kb file. If i download through save page in the browser such as firefox , i am getting 350kb+ of data with all the content in it. There must be something that saves inline javascript as well.

    The file i am receiving is the same as save html only,not the save complete file.

    As i read the command you posted it is ;
    get all files needed,
    dont get files already got,
    get all parts needed to display the file,
    save as html
    follow links (convert on)
    only write windows compatible file names
    span hosts (website.org)
    dont follow links outside parent folder

    Although this reads as exactly what im after, it still doesn't do it. I'll continue searching

  4. #4
    Linux Guru reed9's Avatar
    Join Date
    Feb 2009
    Location
    Boston, MA
    Posts
    4,651
    As far as I can tell, the difference is that firefox is downloading all the source files referenced in the page script, and wget is not. Against the site I checked at least, the basic html file was identical. Remove the domains line, and play with options like --span-hosts.

  5. #5
    Just Joined!
    Join Date
    Apr 2009
    Posts
    18
    It seems to be a problem with convert links. It is ignoring the whole section created by inline java script. If I grab these files too, the page still isnt availiable offline with the data.

    Firefox is doing something that wget isn't doing

  6. #6
    Just Joined!
    Join Date
    Apr 2009
    Posts
    18
    I tried to make wget behave the same as firefox, ie with a large command line as if it was being used as the engine to a web page gui:

    Code:
    wget -p -k --referer="h**p://guidatv.sky.it" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300" -dnv h**p://guidatv.sky.it/guidatv/grid.html
    running this you can see it skipping javascript or css in the standard output, also misinterpreting some javascript # urls. I am at a loss now, I can get everything but the data!!

    adding the -H switch and -E switch results in 1kb of extra data, but not the information wanted.

    This also goes for people wanting to offline view these types of websites. It seems that somehow i will have to convert the links to non-relative after working out how they work.

    There must be a simple method or each site may differ

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •