Results 1 to 7 of 7
I am having problems with a script I am working on and was looking for some help. I have a script that grabs the source code from a google search ...
- 03-22-2009 #1Just Joined!
- Join Date
- Mar 2009
- Posts
- 31
Stripping links from web page
I am having problems with a script I am working on and was looking for some help. I have a script that grabs the source code from a google search page. I would like to find out how to strip the links from the source as well as the link name
example:
some stuff <a href="www.somedomain.com/webpage.html"> Click Here </a> some more stuff...
I would like to read both the "www.somedomain.com/webpage.html" and "Click Here" part of it out of the source for all links. If you take a look at the source or a google search the page is unformatted and has multiple links per line. I was trying to use a combination of sed and awk to do this but no matter how I go about it I keep getting stuck. Can anyone give me a suggestion on how is the best way to do this?
Thanks,
John
- 03-23-2009 #2
Perl should be able to do it for you. The following module is probably a good place to start:
HTML::LinkExtor
There may be others.
- 03-23-2009 #3
This regexp should give you a start
Code:<a[^>]*>
If we hit that bullseye, the rest of the dominoes will fall like a house of cards. Checkmate! (Zapp Brannigan)
My new blog. It's probably not as good as I think it is.
- 03-27-2009 #4Just Joined!
- Join Date
- Mar 2009
- Posts
- 31
Thanks for the replies. I had one more question. Is there a regular expression that matches the first occurrence of something say in
stuff <a href="www.site1.com"> Link1 </a> more content <a href="www.site2.com"> Link 2 </a> and more stuff
is there a way to match any thing before that and then that. something like
sed 's/.*<a href=\"//' -- this doesn't work though I already tried it just a rough example
so that it would output
www.site1.com"> Link1 </a> more content <a href="www.site2.com"> Link 2 </a> and more stuff
Thanks,
John
- 03-27-2009 #5
That probably gives you:
www.site2.com">....
The regular expression '.*' is GREEDY (takes as much as it can from the string). A ? character reverses this. What you are looking for is:
Code:sed 's/.*?<a href=\"//'
- 03-27-2009 #6Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 03-27-2009 #7
You're totally right drl, sorry about that - I use perl for most things so sorry if I've confused anyone. I had a go at getting this to work in sed without any successs.
The following will work, but is inefficient as it has to start the perl interpreter every time you add a new line:
The other way is to do the whole thing within perl (or any other language): send query, get response, parse output, reformat, output to browser. This is how I would be doing it given a choiceCode:<pipe stuff into> | perl -e 'while (<STDIN>) {s/^.*?href=\"//; print $_}'


Reply With Quote
