Results 1 to 5 of 5
Hello guys
I need to write awk script that would take an html page and output a list of each unique http link on that webpage followed by the number ...
- 04-28-2007 #1Just Joined!
- Join Date
- Mar 2007
- Posts
- 4
Awk scripting and usage of regex to locate a hyperlink
Hello guys
I need to write awk script that would take an html page and output a list of each unique http link on that webpage followed by the number of times it occurred in that file.
e.g.
-----------------------------------------
Webpage: index.html
http://www.google.com/ 3
www.supersite.com/dir/dir2/index.html 5
-----------------------------------------
To do that I'm thinking of using regular expressions.
I'm using the following regex to find a hyper link in the html file.
It outputs the whole line that contains the link. Say we have the following html code:Code:/<(a|A).+(href|HREF)=\"(.+?)\">/
--------------------------------------------
<html>
<p> Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
</html>
--------------------------------------------
The output will be:
--------------------------------------------
Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link
--------------------------------------------
What i need is to somehow get rid of all unnecessary output leaving the target url of a link and nothing else. So that the output would be:
--------------------------------------------
www.google.com
--------------------------------------------
I've tried using the following, however if the are several links on a line only the first link is found:
{ start = index($0, "<a")
end = index($0,"\">")
len = end - start
print substr($0,start,len) }
Can somebody help me please?
Thanks
- 04-29-2007 #2
Quick and dirty, using your own data provided:
If your data (html) isn't strictly following that format for <a href> tags, you'll need to tweak as needed. Just something to get you started.Code:[helen@troy ~]$ awk --version | head -1 GNU Awk 3.1.3 [helen@troy ~]$ cat some-html-file <html> <p> Here is some text before the link, the <a href = "www.google.com"> link </a> Some text after the link </html> [helen@troy ~]$ awk '/a href/{ sub(/.*a href = "/, ""); sub(/".*/,""); print }' some-html-file www.google.com
- 04-29-2007 #3Just Joined!
- Join Date
- Mar 2007
- Posts
- 4
reply
This is great, thank you so much!
There is only one problem and i'm not sure how to fix it.
If there are a multiple links on one line the script will only output the last one;
For example:
The output will be:Code:bla bla <a href="target1.htm">link1</a> bla bla <a href="target2.htm">link2</a> bla bla
ThanksCode:target2.htm and not: target1.htm target2.htm
- 04-29-2007 #4
It's getting dirtier. There are cleaner ways, but if this is going to be an awk-only solution:
Code:[helen@troy ~]$ cat some-hmtl-file bla bla <a href="target1.htm">link1</a> bla bla <a href="target2.htm">link2</a> bla bla [helen@troy ~]$ awk '{ gsub(/<\/a>/,"\n"); print }' some-hmtl-file | \ awk '/a href/{ sub(/.*a href ?= ?"/, ""); sub(/".*/,""); print }' target1.htm target2.htm
- 04-30-2007 #5Just Joined!
- Join Date
- Mar 2007
- Posts
- 4
thanks
Thanks for your help, anomie!


Reply With Quote