Find the answer to your Linux question:
Results 1 to 5 of 5
Hello, I would like to delete all the footnotes in all my htm files. Hence, I have to delete the whole font tag pairs, i.e. deleting everything between the begin/end ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Oct 2004
    Posts
    8

    Does Sed Search/Replace Work For Multiple Matches On The Same Line?



    Hello,

    I would like to delete all the footnotes in all my htm files. Hence, I have to delete the whole font tag pairs, i.e. deleting everything between the begin/end font tags.

    I create a testfile, of which data parts of all four lines are the same except for the number of font tag pairs, for testing as follows.
    # sed -n 'l' testfile
    1. The quick brown fox jumps over the lazy dog. -- this is correct li\
    ne<br>$
    2. The quick<FONT SIZE="1"><SUB>Footnote_1</SUB></FONT> brown fox jum\
    ps over the lazy dog. -- 1x FONT Tag Pairs<br>$
    3. The quick<FONT SIZE="1"><SUB>Footnote_1</SUB></FONT> brown fox jum\
    ps<FONT COLOR="RED"><SUB>Footnote_2</SUB></FONT> over the lazy dog. -\
    - 2x FONT Tag Pairs<br>$
    4. The quick<FONT SIZE="1"><SUB>Footnote_1</SUB></FONT> brown fox jum\
    ps<FONT COLOR="RED"><SUB>Footnote_2</SUB></FONT> over the lazy<FONT F\
    ACE="COURIER"><SUB>Footnote_3</SUB></FONT> dog. -- 3x FONT Tag Pairs<\
    br>$
    #

    Then I 'sed' to remove the font tag pairs as follows.
    # sed -e 's/<FONT.*<\/FONT>//g' file4
    1. The quick brown fox jumps over the lazy dog. -- this is correct line<br>
    2. The quick brown fox jumps over the lazy dog. -- 1x FONT Tag Pairs<br>
    3. The quick over the lazy dog. -- 2x FONT Tag Pairs<br>
    4. The quick dog. -- 3x FONT Tag Pairs<br>
    #

    The above sed script works only for the line containing only 1x font tag pairs. It looks to me that sed only finds the longest match on the line. For the lines containing more than one font tag pairs, my sed script doesn't work. It deletes some useful data on the line. What am I missing?

    Please help!!!

    Thank you very much for your assistance.

    Best Regards,
    cibalo

  2. #2
    Just Joined!
    Join Date
    Apr 2009
    Posts
    33
    I'm not sure I understand what you're saying
    you want to delete all '<FONT...>' and '</FONT>' and leave the rest as is?
    provided that the opening '<' and the closing '>' are in the same line
    this
    sed 's?</*[fF][oO][nN][tT][^>]*>??g'
    should work not only for '<FONT...>' and '</FONT>'
    but also for '<font...>' <Font...>' etc
    that would remove all the font tags

  3. #3
    Linux Newbie tetsujin's Avatar
    Join Date
    Oct 2008
    Posts
    117
    Quote Originally Posted by cibalo View Post
    The above sed script works only for the line containing only 1x font tag pairs. It looks to me that sed only finds the longest match on the line. For the lines containing more than one font tag pairs, my sed script doesn't work. It deletes some useful data on the line. What am I missing?
    This is true. When you use a pattern like .*, the regex algorithm takes by default what's known as a "greedy" approach - that pattern can consume 0 characters or all the remaining characters on the line. How many characters it actually consumes is a function of where it's able to find the remainder of the pattern - the "greedy" approach is to start with the maximum number of characters and work backward until it finds a match... Therefore a pattern like "<font.*</font>" finds the first occurrence of <font and the last occurrence of </font> and matches everything in between.

    I don't know how you get "sed" to do what you want... But if you're willing to work in Perl, www.bayview.com/blog/2003/02/12/non-greedy-regular-expressions - this page explains how to get the Kleene Star in Perl regex to not be greedy: so you could use something like this:

    perl -pe 's#<font.*?</font>##gi;'

    Notes:
    • Perl allows you to use a different character in place of slash when specifying a regex (sed does, too, actually...)- in this case I used # so the slash on the close-font-tag wouldn't be affected.
    • The question-mark is what specifies that the preceding star is non-greedy. Sed does not appear to support this.
    • The "i" flag at the end of the regex does a case-insensitive match. I believe sed supports this, too.


    (EDIT): I'd simply love to hotlink that URL for you, but some genius decided that I can't post links or anything that looks like a URL unless I've made at least fifteen posts first. I could go spam the boards for half an hour and get my post count high enough to be more helpful - for now I guess you'll have to cut-paste the URL.

    (EDIT): Another approach I thought might work, but doesn't:

    sed -e 's#\(.*\)<font.*</font>#\1#gi;

    The idea is that the first .* is greedy, so it'll consume everything up to the last occurrence of "<font" each time it finds a match, and then delete everything up to the last following occurrence of "</font>"... But when I tested it it seemed to only get the last occurrence - despite the "g" flag it didn't get any others...

    Also, to get this whole thing to be as reliable as possible in the face of possible nested font tags, what you'd really want to do is repeatedly remove everything from the last occurrence of "<font" to the first following occurrence of "</font>"... That way the start and end of the range you're removing would always match. I tried this pattern in Perl to get that behavior:

    perl -pe 's#(.*)<font.*?</font>#\1#gi;'

    but it didn't work... As with the "sed" pattern, above, it just got one match and that was it... To get the desired behavior I had to do this:

    perl -pe 'while (s#^(.*)<font.*?</font>#\1#i) {}'

    This correctly processed, for instance, "a<font>b<font>c</font>d</font>e<font>f</font>g" to yield "aeg" - my first Perl solution would have yielded "ad</font>eg"...

    As with all the solutions presented so far, mine rely on the open and close font tags being on the same line... In Perl you could solve that by buffering the whole file into a variable and then doing the while loop on it...

  4. $spacer_open
    $spacer_close
  5. #4
    Just Joined!
    Join Date
    Apr 2009
    Posts
    33
    Quote Originally Posted by vonbiber View Post
    I'm not sure I understand what you're saying
    you want to delete all '<FONT...>' and '</FONT>' and leave the rest as is?
    provided that the opening '<' and the closing '>' are in the same line
    this
    sed 's?</*[fF][oO][nN][tT][^>]*>??g'
    should work not only for '<FONT...>' and '</FONT>'
    but also for '<font...>' <Font...>' etc
    that would remove all the font tags
    with this it doesn't matter whether the opening <font and the closing </font>
    tags are located in the same line
    the only required thing is that you don't have something like
    <font
    ....>

    but all the <font ....> and </font> tags are removed

    in which case, there's still a way to do it but within a sed script
    (that's what I do when I have to clean up html files)

  6. #5
    Just Joined!
    Join Date
    Oct 2004
    Posts
    8
    Hello vonbiber, tetsujin!

    Thank you guys for replying to my post.

    Best Regards,
    cibalo

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •