Results 1 to 5 of 5
hi, i need to write a linux script / use linux tool that will do the following:
1. parse url (html page)
2. extract data based on regexp
3. use ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
- 09-25-2012 #1Just Joined!
- Join Date
- Sep 2012
- Posts
- 3
HTML manipulation
hi, i need to write a linux script / use linux tool that will do the following:
1. parse url (html page)
2. extract data based on regexp
3. use this data in a new html query and write the result to a new html document.
for example:
1. parse a url, (code example):
code code code code id=185646 code code code
2. from the code above, using regexp, extract an id number. for example: 185646
3. parse some_url/<extracted id> using regexp (in this case it would be: some_url/185646])
4. repeat step 3 for each extracted id and write result to html doc
hope it was clear enough :P
thanks!
- 09-25-2012 #2
hi and welcome
My suggestion would be to use ruby and the nokogiri gem.
Since html already is structured, you can as well take advantage of that by setting a proper css/xpath/search path via nokogiri.
You need not rely on a regex, that might or might not match.
Once you have all the elements, you can contruct a new html object and insert them there.You must always face the curtain with a bow.
- 09-26-2012 #3Just Joined!
- Join Date
- Dec 2009
- Location
- California
- Posts
- 89
I'm a bit confused about the task. Are you supposed to parse an HTML page or are you supposed to parse a URL?
The former is non-trivial unless you make some assumptions about the format of the html and the only sure way to do it is to read through
the entire page character by character storing the current state of the parser and acting differently depending on the state. This is called a finite state machine.
To parse a URL, you would probably do it in exactly the same way... The url will look something like this:
IANA — Example domains
There are two cases that you need to be aware of - the case where id is the first entry after the path and the case where it is a subsequent entry after the path.
I would use the UNIX command sed to remove the parts of the URL for which you have no interest.
$ echo $URL |sed -e "s/.*?id=//"
The above command removes the preamble from the URL in case where id comes immediately after the path. The rest is left to the student.
- 09-29-2012 #4Just Joined!
- Join Date
- Sep 2012
- Posts
- 3
hi, thanks for your reply guys,
abarclay > i am trying to parse an html page (web page)
i am new to programming and scripting, so i am still trying to figure out how to use nokogiri. after i install it, i just use it as a bash script?
****update: i found this great ruby tutorial,httb://ruby.bastardsbook.com/ so i think im on the right path to knowing ruby and nokogiri (eventually)
meanwhile, i am trying to use a simpler way to accomplish my task, using curl command :
curl httb://subtitle.co.il/browsesubtitles.php?cs=movies | grep -o '(?<=list.php\?mid=)\d+' > parse.html
this was supposed to list all relevant id's to an html file, but it turned out empty. i am sure my regexp is ok, the problem is that grep won't use regexp, only text. i know that because i tried:
curl httb://subtitle.co.il/browsesubtitles.php?cs=movies | grep -o 'subtitle' > parse.html which also gave me an empty file, and:
curl httb://subtitle.co.il/browsesubtitles.php?cs=movies | grep -o "subtitle" > parse.html which gave me a list of all "subtitle" instances on the wab page.
any idea why grep won't take regexp?
thanksLast edited by buntuser; 09-29-2012 at 01:19 AM.
- 10-01-2012 #5Just Joined!
- Join Date
- Sep 2012
- Posts
- 3
i reply to my own post, just in case someone finds it useful.
i got the grep with regexp to work using the -P switch.


Reply With Quote
