Find the answer to your Linux question:
Results 1 to 7 of 7
I am having problems with a script I am working on and was looking for some help. I have a script that grabs the source code from a google search ...
  1. #1
    Just Joined!
    Join Date
    Mar 2009
    Posts
    31

    Stripping links from web page

    I am having problems with a script I am working on and was looking for some help. I have a script that grabs the source code from a google search page. I would like to find out how to strip the links from the source as well as the link name

    example:
    some stuff <a href="www.somedomain.com/webpage.html"> Click Here </a> some more stuff...

    I would like to read both the "www.somedomain.com/webpage.html" and "Click Here" part of it out of the source for all links. If you take a look at the source or a google search the page is unformatted and has multiple links per line. I was trying to use a combination of sed and awk to do this but no matter how I go about it I keep getting stuck. Can anyone give me a suggestion on how is the best way to do this?


    Thanks,
    John

  2. #2
    Linux Newbie Ziplock's Avatar
    Join Date
    Jan 2009
    Location
    Adelaide
    Posts
    169
    Perl should be able to do it for you. The following module is probably a good place to start:

    HTML::LinkExtor

    There may be others.

  3. #3
    Trusted Penguin elija's Avatar
    Join Date
    Jul 2004
    Location
    Either at home or at work or down the pub
    Posts
    2,300
    This regexp should give you a start

    Code:
    <a[^>]*>
    If we hit that bullseye, the rest of the dominoes will fall like a house of cards. Checkmate! (Zapp Brannigan)


    My new blog. It's probably not as good as I think it is.

  4. #4
    Just Joined!
    Join Date
    Mar 2009
    Posts
    31
    Thanks for the replies. I had one more question. Is there a regular expression that matches the first occurrence of something say in

    stuff <a href="www.site1.com"> Link1 </a> more content <a href="www.site2.com"> Link 2 </a> and more stuff

    is there a way to match any thing before that and then that. something like

    sed 's/.*<a href=\"//' -- this doesn't work though I already tried it just a rough example

    so that it would output

    www.site1.com"> Link1 </a> more content <a href="www.site2.com"> Link 2 </a> and more stuff

    Thanks,
    John

  5. #5
    Linux Newbie Ziplock's Avatar
    Join Date
    Jan 2009
    Location
    Adelaide
    Posts
    169
    That probably gives you:

    www.site2.com">....

    The regular expression '.*' is GREEDY (takes as much as it can from the string). A ? character reverses this. What you are looking for is:

    Code:
    sed 's/.*?<a href=\"//'

  6. #6
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi, ziplock.
    Quote Originally Posted by Ziplock View Post
    That probably gives you:

    www.site2.com">....

    The regular expression '.*' is GREEDY (takes as much as it can from the string). A ? character reverses this. What you are looking for is:

    Code:
    sed 's/.*?<a href=\"//'
    Possibly a typo, the minimal *? quantifier works in perl, but not in GNU/sed 4.1.2 & 4.1.5 that I use. Does yours work differently? ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  7. #7
    Linux Newbie Ziplock's Avatar
    Join Date
    Jan 2009
    Location
    Adelaide
    Posts
    169
    Quote Originally Posted by drl View Post
    Hi, ziplock.

    Possibly a typo, the minimal *? quantifier works in perl, but not in GNU/sed 4.1.2 & 4.1.5 that I use. Does yours work differently? ... cheers, drl
    You're totally right drl, sorry about that - I use perl for most things so sorry if I've confused anyone. I had a go at getting this to work in sed without any successs.

    The following will work, but is inefficient as it has to start the perl interpreter every time you add a new line:

    Code:
     <pipe stuff into> | perl -e 'while (<STDIN>) {s/^.*?href=\"//; print $_}'
    The other way is to do the whole thing within perl (or any other language): send query, get response, parse output, reformat, output to browser. This is how I would be doing it given a choice

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...