Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 14
Hey, folks, I'm not sure if this is possible, but I would like to know if there's a way to exclude a pattern using regular expression in vim/bash/perl/sed. For example, ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Jan 2004
    Posts
    93

    Regex help -- pattern exclusion


    Hey, folks,
    I'm not sure if this is possible, but I would like to know if there's a way to exclude a pattern using regular expression in vim/bash/perl/sed. For example, I have a html file and I would like to strip off all the <foo bar> </foo bar> tags except for
    . I know [^foobar] is to exclude certain characters. But how do I exclude a string or even a pattern, like "br"? Of course I could first change all the
    s to something else and change them back after I remove all the <foo bar></foo bar> tags. But I just wanna know if it's possible to exclude patterns. Plus, if I'm using vim, and I want to get rid of everything except for URL +
    s (e.g. http://foo.bar.com
    ), how do I do that? Thanks in advance.

  2. #2
    Linux Guru sarumont's Avatar
    Join Date
    Apr 2003
    Location
    /dev/urandom
    Posts
    3,682
    Code:
    /^<&#91;^b&#93;&#91;^r&#93;.*>/
    This will match any line that starts with a <. If you have leading whitespace, you may want to remove the leading ^ (telling it that the < will be at the beginning of the line).
    "Time is an illusion. Lunchtime, doubly so."
    ~Douglas Adams, The Hitchhiker's Guide to the Galaxy

  3. #3
    Just Joined!
    Join Date
    Jan 2004
    Posts
    93
    Quote Originally Posted by sarumont
    Code:
    /^<&#91;^b&#93;&#91;^r&#93;.*>/
    This will match any line that starts with a <. If you have leading whitespace, you may want to remove the leading ^ (telling it that the < will be at the beginning of the line).
    I don't think the pattern above matches <body> or <tr>, because it excludes any tag with first character be "b" and second be "r" inside <>. Any idea?

  4. $spacer_open
    $spacer_close
  5. #4
    Just Joined!
    Join Date
    Jan 2005
    Posts
    15

    Perl is the answer

    savage_cabbage,

    I would encourage you to try using Perl instead of sed for pattern matching and regex substitutions.

    Something like this should work:

    my @lines = `cat /your/html/file`;
    my @new_lines = ();

    foreach (@lines) {
    my $line = $_;

    # find the url after http but before the breakline tag
    $line =~ s/^(.*)http:\/\/(.*)
    (.*)$/;

    # squash whitespaces on the url
    $2 =~ s/\s*//g;

    # push the url and a breakline tag to the new array
    push @new_lines, "http://" . $2 . "
    \n";
    }

    # Do stuff with @new_lines here...

    exit 1;

  6. #5
    Just Joined!
    Join Date
    Jan 2004
    Posts
    93
    Am I able to do this using one regex? My question is generally how to exclude certain patterns using regex regardless of what program I use. A specific example would be to remove all the html tags in the text file below, except for
    :
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html dir="ltr"><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><meta http-equiv="Content-Style-Type" content="text/css"><title>A Test Page</title></head>
    <body>http&#58;//blahblah.com
    http&#58;//blahblah.com
    </body></html>
    The result should look like:
    Code:
    A Test Page
    http&#58;//blahblah.com
    http&#58;//blahblah.com
    Any idea?

  7. #6
    Linux Guru sarumont's Avatar
    Join Date
    Apr 2003
    Location
    /dev/urandom
    Posts
    3,682
    Quote Originally Posted by savage_cabbage
    Quote Originally Posted by sarumont
    Code:
    /^<&#91;^b&#93;&#91;^r&#93;.*>/
    This will match any line that starts with a <. If you have leading whitespace, you may want to remove the leading ^ (telling it that the < will be at the beginning of the line).
    I don't think the pattern above matches <body> or <tr>, because it excludes any tag with first character be "b" and second be "r" inside <>. Any idea?
    No...the regex I posted matches patterns starting with <, and *not* followed by b or r. The leading ^ is what matches the beginning of the line. The [^b] and [^r] mean "not b" and "not r". I forgot to mention, however, that this is a perl regex.
    "Time is an illusion. Lunchtime, doubly so."
    ~Douglas Adams, The Hitchhiker's Guide to the Galaxy

  8. #7
    Just Joined!
    Join Date
    Jan 2004
    Posts
    93
    No...the regex I posted matches patterns starting with <, and *not* followed by b or r. The leading ^ is what matches the beginning of the line. The [^b] and [^r] mean "not b" and "not r". I forgot to mention, however, that this is a perl regex.
    I still don't quite get it I want the regex to be able to match all the html tags except for
    and
    . So it shouldn't exclude tags like <body>,

    and <tr>. The pattern you gave
    Code:
    /^<&#91;^b&#93;&#91;^r&#93;.*>/
    reads: match a pattern starting with "<" at each line, followed by a character that's not "b", followed by a charactor that's not "r" followed by zero or more characters, and ending with ">". So when it's trying to match "<body>", it fails to match it because "<" is followed by a character "b". Similar thing happens to <tr> because "t" is followed by a "r". It also doesn't match tags with a single character (e.g.

    ). In addition, instead of
    Code:
    /^<&#91;^b&#93;&#91;^r&#93;.*>/
    , you have to make the "*" LAZY, like this
    Code:
    /^<&#91;^b&#93;&#91;^r&#93;.*?>/
    so that it matches the first occurence of ">" instead of the last one.

  9. #8
    Linux Enthusiast
    Join Date
    Jan 2005
    Posts
    575
    Perhaps I don't understand your question savage but as
    far as I know a regexp is only meant to match a string.If you want to do some form of substitution in
    addition to matching the you need some additional programming construct.So for example in the following
    Code:
    sed -e s/abc/cba file
    the regexp involved is abc but the whole s/abc/cba does
    not count as a regular expression.So what are you willing
    to use in addition to regexps to perform your task ?

  10. #9
    Just Joined!
    Join Date
    Jan 2004
    Posts
    93
    Quote Originally Posted by Santa's little helper
    Perhaps I don't understand your question savage but as
    far as I know a regexp is only meant to match a string.If you want to do some form of substitution in
    addition to matching the you need some additional programming construct.So for example in the following
    Code:
    sed -e s/abc/cba file
    the regexp involved is abc but the whole s/abc/cba does
    not count as a regular expression.So what are you willing
    to use in addition to regexps to perform your task ?
    Sorry for not making it clear enough. Let me rephrase my question. As we know, the regexp
    Code:
    aaa&#91;^b&#93;ccc
    matches any string such as "aaacccc", or "aaadccc" or "aaazccc", etc, except for aaabccc. Now what if I want to exclude a string instead of only one character? Say I want to match any strings start with "aaa"; followed by any two character strings such as "ef", "fg", "gg", etc, except for the string "xy"; followed by "ccc". Some examples would be to match "aaaaaccc", "aaaghccc", "aaa23ccc", etc, but not "aaaxyccc". We know that [^x] only excludes one char, "x". My question is how to exclude a string with more than one char, like "xy"?

  11. #10
    Linux Enthusiast
    Join Date
    Jan 2005
    Posts
    575
    Ok , I get you now.In your example you would use something like aaa([^x].|x[^y])ccc
    Or if you want to exclude the string xyz you'd write
    ([^x]|x[^y]|xy[^z])
    This basically says:
    either the 1st character should not be x
    or
    if the 1st char is x then the 2nd should not be y
    or
    if the 1st is x and the 2nd y then the 3d should not be z

    You can do a similar thing with strings of any length but
    it gets more and more complicated as the string gets longer.And I don't know how fast it would run either.

    Of course if you want to exclude a whole line containing a string then I'm sure all the tools you mentioned have a specific option for that.

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •