Find the answer to your Linux question:
Results 1 to 5 of 5
hi, i'm trying to use grep and sed to parse out a "single" very long HTML line, and there are a few tags that i'm interested in from that line. ...
  1. #1
    Just Joined!
    Join Date
    Jan 2008
    Posts
    1

    how to parse data from a single line

    hi,

    i'm trying to use grep and sed to parse out a "single" very long HTML line, and there are a few tags that i'm interested in from that line. this is all that i'm trying to parse out from the entire line.

    however, since it is just "one" line... i'm running into problems. if i try to delete everything that's not relevent using sed, then it deletes more info than i want to because ".*" matches the longest matching regex.

    using grep, if i grep for the tag that i want, then again, since it is a "single" line... it prints the entire line. i'm wondering if there is a simpler way of doing it, which i'm missing out. any ideas will be very helpful!

    thank you.

  2. #2
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192
    No, grep isn't going to get you anywhere. It's not good at outputting parts of a given line.

    sed is ok, but its regular expressions are limited. Allow me to drag you kicking and screaming into the world of Perl. (It won't hurt, I promise you.)

    Consider the following shell script:
    Code:
    #!/bin/sh
    
    cat > 1.dat <<EOD
    aaa<tag1>bbb<tag2>ccc<tag3>ddd
    EOD
    
    sed -e 's/^.*</</' 1.dat
    sed -e 's/^[^<]*</</' 1.dat
    echo
    perl -pe 's/^.*</</' 1.dat
    perl -pe 's/^[^<]*</</' 1.dat
    echo
    perl -pe 's/^.*?</</' 1.dat
    echo
    perl -pe 's/^.*(<tag[23])/$1/' 1.dat
    perl -pe 's/^.*?(<tag[23])/$1/' 1.dat
    echo
    perl -pe 's/^.*?</</' 1.dat
    perl -pe 's/^.*?\</\</' 1.dat
    Run it, and you get this:
    Code:
    <tag3>ddd
    <tag1>bbb<tag2>ccc<tag3>ddd
    
    <tag3>ddd
    <tag1>bbb<tag2>ccc<tag3>ddd
    
    <tag1>bbb<tag2>ccc<tag3>ddd
    
    <tag3>ddd
    <tag2>ccc<tag3>ddd
    
    <tag1>bbb<tag2>ccc<tag3>ddd
    <tag1>bbb<tag2>ccc<tag3>ddd
    The problem with sed's regular expression engine, as you've discovered, is that it's greedy; it consumes as much as it can of the line, as soon as it can, as long as that doesn't interfere with successful matching.

    But, as you can see with the second example above, sed's regular expression engine can be forced to be ungreedy, as long as you're willing to stop at a single character.

    The purpose of the third and fourth examples is to show you that Perl can do exactly the same stuff, with the same output.

    Clearly the [^<]* trick is a bit hokey, particularly if you want to stop at something more complex than a single character.

    Look at the fifth example. Its regular expression, unavailable in the sed that I'm running at least, has a question mark after the ".*". This instructs Perl to match not as many characters as possible and still get a match, but to match as few characters as possible and still get a match. And you'll notice that although the regular expression is almost like that in the first and third examples, its output is the same as in the second and fourth examples.

    The sixth and seventh examples show stopping at the second or third tag, depending. The sixth example, using a greedy ".*", goes up to tag3. The seventh, not being greedy (because of the question mark in the regular expression), stops at tag2.

    Notice that there's this funny $1 in the replacement part of the statement in the sixth and seventh examples. This means "replace what you found with what was between the first set of parentheses". $2 would mean the second set, and so forth.

    But what if you want to actually match against parentheses? You escape a parenthesis by putting a backslash in front of it. The rules on escaping in Perl regular expressions are quite simple:
    1. Letters and numbers don't have special meaning in regular expressions, except when preceded by a backslash (or a dollar sign). So if you precede a letter or a number with a backslash, you're asking Perl to do something special.
    2. Anything other than a letter or a number might mean something special if it is not preceded by a backslash. An example is the parentheses above. If you want to use anything other than a letter or a number without its magic meaning, precede it with a backslash. If in doubt, put the backslash there, because it is always ok to put a backslash in front of a non letter or a number to let it have a non-magic meaning. This is true even if it doesn't have a magic meaning anyway.

    For example, you now know that parentheses have a magic meaning. But angle brackets do not. But you want to use angle brackets as ordinary data and you can't remember whether they have a magic meaning, it never hurts to put a backslash in front of these.

    The eighth and ninth examples illustrate this. The eighth example hasn't put backslashes in front of the angle brackets, and the ninth example has done that. But the output is the same.

    Hope this helps.

    Edit above is in red. Sorry about that.
    --
    Bill

    Old age and treachery will overcome youth and skill.

  3. #3
    Linux User
    Join Date
    Aug 2006
    Posts
    458
    Quote Originally Posted by new_2_html View Post
    hi,

    i'm trying to use grep and sed to parse out a "single" very long HTML line,
    thank you.
    show that HTML line

  4. #4
    scm
    scm is offline
    Linux Engineer
    Join Date
    Feb 2005
    Posts
    1,044
    Quote Originally Posted by wje_lf View Post
    But, as you can see with the second example above, sed's regular expression engine can be forced to be ungreedy, as long as you're willing to stop at a single character.
    Being pedantic, you're not actually stopping sed from doing greedy matching; it's still finding the biggest match it can, you've just change the criteria for that match.

    I agree that perl is the way to go for complex matching, though, especially as perl will always handle arbitrarily long lines, whereas some implementations of sed cannot.

  5. #5
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192
    it's still finding the biggest match it can, you've just change the criteria for that match
    You're right. I was wrong to say
    sed's regular expression engine can be forced to be ungreedy
    I should have said:

    sed's regular expression engine can be forced to give you the results you want, just as though it were non-greedy.

    I agree that perl is the way to go for complex matching, though, especially as perl will always handle arbitrarily long lines, whereas some implementations of sed cannot.
    Another reason is that some implementations of sed don't have all the regular expression syntax that Perl does.

    Another reason is that some implementations of sed don't have the -i switch. (Mine doesn't; I'm running Slackware 9.1, as befits an old curmugeon.)

    But the best reason is that we get to suck new_2_html into using Perl, which will grow on him, and his use of it will grow.

    BWUHAHAHAHAHAHA!
    --
    Bill

    Old age and treachery will overcome youth and skill.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...