Find the answer to your Linux question:
Results 1 to 4 of 4
Hi all, I have been trying to make/engineer/download a web data extractor for linux (im using solaris right now) and I have been unable to find one that works. For ...
  1. #1
    Just Joined!
    Join Date
    Aug 2011
    Posts
    2

    Help with web data extractor!

    Hi all,

    I have been trying to make/engineer/download a web data extractor for linux (im using solaris right now) and I have been unable to find one that works.

    For simple web sites "lynx -dump website > file.txt" works well and basically gives me what I want. However, for more complicated sites, such as ones that have internal windows that change without the url changing, I have been unable to extract all the info I want with lynx.

    I know that there are ways to do this with php but I know absolutely no php and feel like learning it just to create a simple spider/data extractor would be an inefficient use of time...

    I would prefer to download a simple data extractor that can be run from the command line if possible. I don't need links, images, or audio. I just need text.

    If anyone knows of a web data extractor or has a clever technique for downloading 'all' text data from a website and storing it in a file I would be so unbelievably happy.

    Thanks in advance.

  2. #2
    Linux Guru
    Join Date
    May 2011
    Posts
    1,842
    Try wget. It has a million options for downloading websites.

    Install it, if not installed already and
    Code:
    man wget

  3. #3
    Just Joined!
    Join Date
    Aug 2011
    Posts
    2
    I've tried playing around with wget but as far as I can tell it can't do what I want. I don't care about the source code, the html behind the site, the link structure of the website, or the ability to load it offline. All i want is the ascii text printed on the website. If anyone knows how to do this with wget let me know =).

    I basically just want all of the text on a site to be copied and pasted into a file from the terminal (preferably with a one liner or quick program).

  4. #4
    Linux Guru
    Join Date
    May 2011
    Posts
    1,842
    sorry, i have should have been more clear - at least this is what I've tried in the past, which works for me:

    run the output of your wget files thru tidy, then run that thru html2text.py

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...