Find the answer to your Linux question:
Results 1 to 5 of 5
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    HTML manipulation

    hi, i need to write a linux script / use linux tool that will do the following:

    1. parse url (html page)
    2. extract data based on regexp
    3. use this data in a new html query and write the result to a new html document.

    for example:

    1. parse a url, (code example):
    code code code code id=185646 code code code

    2. from the code above, using regexp, extract an id number. for example: 185646
    3. parse some_url/<extracted id> using regexp (in this case it would be: some_url/185646])
    4. repeat step 3 for each extracted id and write result to html doc

    hope it was clear enough :P


  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    hi and welcome

    My suggestion would be to use ruby and the nokogiri gem.
    Since html already is structured, you can as well take advantage of that by setting a proper css/xpath/search path via nokogiri.
    You need not rely on a regex, that might or might not match.

    Once you have all the elements, you can contruct a new html object and insert them there.
    You must always face the curtain with a bow.

  3. #3
    Just Joined!
    Join Date
    Dec 2009
    I'm a bit confused about the task. Are you supposed to parse an HTML page or are you supposed to parse a URL?

    The former is non-trivial unless you make some assumptions about the format of the html and the only sure way to do it is to read through
    the entire page character by character storing the current state of the parser and acting differently depending on the state. This is called a finite state machine.

    To parse a URL, you would probably do it in exactly the same way... The url will look something like this:
    IANA &mdash; Example domains
    There are two cases that you need to be aware of - the case where id is the first entry after the path and the case where it is a subsequent entry after the path.
    I would use the UNIX command sed to remove the parts of the URL for which you have no interest.

    $ echo $URL |sed -e "s/.*?id=//"

    The above command removes the preamble from the URL in case where id comes immediately after the path. The rest is left to the student.

  4. $spacer_open
  5. #4
    hi, thanks for your reply guys,
    abarclay > i am trying to parse an html page (web page)

    i am new to programming and scripting, so i am still trying to figure out how to use nokogiri. after i install it, i just use it as a bash script?
    ****update: i found this great ruby tutorial,httb:// so i think im on the right path to knowing ruby and nokogiri (eventually)
    meanwhile, i am trying to use a simpler way to accomplish my task, using curl command :
    curl httb:// | grep -o '(?<=list.php\?mid=)\d+' > parse.html
    this was supposed to list all relevant id's to an html file, but it turned out empty. i am sure my regexp is ok, the problem is that grep won't use regexp, only text. i know that because i tried:
    curl httb:// | grep -o 'subtitle' > parse.html which also gave me an empty file, and:
    curl httb:// | grep -o "subtitle" > parse.html which gave me a list of all "subtitle" instances on the wab page.

    any idea why grep won't take regexp?
    Last edited by buntuser; 09-29-2012 at 01:19 AM.

  6. #5
    i reply to my own post, just in case someone finds it useful.
    i got the grep with regexp to work using the -P switch.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts