Results 1 to 4 of 4
Hi all,
I have been trying to make/engineer/download a web data extractor for linux (im using solaris right now) and I have been unable to find one that works.
For ...
- 08-31-2011 #1Just Joined!
- Join Date
- Aug 2011
- Posts
- 2
Help with web data extractor!
Hi all,
I have been trying to make/engineer/download a web data extractor for linux (im using solaris right now) and I have been unable to find one that works.
For simple web sites "lynx -dump website > file.txt" works well and basically gives me what I want. However, for more complicated sites, such as ones that have internal windows that change without the url changing, I have been unable to extract all the info I want with lynx.
I know that there are ways to do this with php but I know absolutely no php and feel like learning it just to create a simple spider/data extractor would be an inefficient use of time...
I would prefer to download a simple data extractor that can be run from the command line if possible. I don't need links, images, or audio. I just need text.
If anyone knows of a web data extractor or has a clever technique for downloading 'all' text data from a website and storing it in a file I would be so unbelievably happy.
Thanks in advance.
- 08-31-2011 #2Linux Guru
- Join Date
- May 2011
- Posts
- 1,842
Try wget. It has a million options for downloading websites.
Install it, if not installed already and
Code:man wget
- 08-31-2011 #3Just Joined!
- Join Date
- Aug 2011
- Posts
- 2
I've tried playing around with wget but as far as I can tell it can't do what I want. I don't care about the source code, the html behind the site, the link structure of the website, or the ability to load it offline. All i want is the ascii text printed on the website. If anyone knows how to do this with wget let me know =).
I basically just want all of the text on a site to be copied and pasted into a file from the terminal (preferably with a one liner or quick program).
- 08-31-2011 #4Linux Guru
- Join Date
- May 2011
- Posts
- 1,842
sorry, i have should have been more clear - at least this is what I've tried in the past, which works for me:
run the output of your wget files thru tidy, then run that thru html2text.py


Reply With Quote