Find the answer to your Linux question:
Results 1 to 5 of 5
Hello guys I need to write awk script that would take an html page and output a list of each unique http link on that webpage followed by the number ...
  1. #1
    Just Joined!
    Join Date
    Mar 2007
    Posts
    4

    Awk scripting and usage of regex to locate a hyperlink

    Hello guys

    I need to write awk script that would take an html page and output a list of each unique http link on that webpage followed by the number of times it occurred in that file.
    e.g.
    -----------------------------------------
    Webpage: index.html

    http://www.google.com/ 3
    www.supersite.com/dir/dir2/index.html 5
    -----------------------------------------

    To do that I'm thinking of using regular expressions.

    I'm using the following regex to find a hyper link in the html file.


    Code:
    /<(a|A).+(href|HREF)=\"(.+?)\">/
    It outputs the whole line that contains the link. Say we have the following html code:
    --------------------------------------------
    <html>
    <p> Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
    </html>
    --------------------------------------------

    The output will be:
    --------------------------------------------
    Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
    --------------------------------------------

    What i need is to somehow get rid of all unnecessary output leaving the target url of a link and nothing else. So that the output would be:

    --------------------------------------------
    www.google.com
    --------------------------------------------

    I've tried using the following, however if the are several links on a line only the first link is found:
    { start = index($0, "<a")
    end = index($0,"\">")
    len = end - start
    print substr($0,start,len) }


    Can somebody help me please?
    Thanks

  2. #2
    Linux Guru anomie's Avatar
    Join Date
    Mar 2005
    Location
    Texas
    Posts
    1,692
    Quick and dirty, using your own data provided:
    Code:
    [helen@troy ~]$ awk --version | head -1
    GNU Awk 3.1.3
    
    [helen@troy ~]$ cat some-html-file 
    <html>
    <p> Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
    </html>
    
    [helen@troy ~]$ awk '/a href/{ sub(/.*a href = "/, ""); sub(/".*/,""); print }' some-html-file 
    www.google.com
    If your data (html) isn't strictly following that format for <a href> tags, you'll need to tweak as needed. Just something to get you started.

  3. #3
    Just Joined!
    Join Date
    Mar 2007
    Posts
    4

    reply

    This is great, thank you so much!
    There is only one problem and i'm not sure how to fix it.

    If there are a multiple links on one line the script will only output the last one;
    For example:

    Code:
    bla bla <a href="target1.htm">link1</a> bla bla <a href="target2.htm">link2</a> bla bla
    The output will be:
    Code:
    target2.htm
    
    and not:
    
    target1.htm
    target2.htm
    Thanks

  4. #4
    Linux Guru anomie's Avatar
    Join Date
    Mar 2005
    Location
    Texas
    Posts
    1,692
    It's getting dirtier. There are cleaner ways, but if this is going to be an awk-only solution:
    Code:
    [helen@troy ~]$ cat some-hmtl-file 
    bla bla <a href="target1.htm">link1</a> bla bla <a href="target2.htm">link2</a> bla bla
    
    [helen@troy ~]$ awk '{ gsub(/<\/a>/,"\n"); print }' some-hmtl-file | \
    awk '/a href/{ sub(/.*a href ?= ?"/, ""); sub(/".*/,""); print }'
    target1.htm
    target2.htm

  5. #5
    Just Joined!
    Join Date
    Mar 2007
    Posts
    4

    thanks

    Thanks for your help, anomie!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...