Find the answer to your Linux question:
Results 1 to 3 of 3
From my understanding when using regex1|regex2 the matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. When im trying to ...
  1. #1
    Just Joined!
    Join Date
    Oct 2008
    Posts
    10

    Question sed - using regex and | need help

    From my understanding when using regex1|regex2 the matching process tries each alternative in turn, from left to right, and the first one that succeeds is used.
    When im trying to extract the name from those examples:

    A) name.can.be.different.20.03.2009.boom
    B) name.can.be.different.20.03.09.boom
    C) name_can_be_different_2009_boom

    by using:

    Code:
    sed 's/\(.*\)[._]\([0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]\|[0-9][0-9].[0-9][0-9].[0-9][0-9]\|[0-9][0-9][0-9][0-9]\)[._]\(.*\)/\1/'
    It gives:

    A) name.can.be.different.20.03 (WRONG)
    B) name.can.be.different (GOOD)
    C) name_can_be_different (GOOD)

    Why does it fail on A) ? By my order of regex pattern it's suppose to match on
    [0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9] but it doesn't ?

  2. #2
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Regular expressions use something called greedy quantifiers. What this means is that a * or + will match as many things as it possibly can, so long as this generates a match.

    So with A), what you are seeing is that the .* grabs as many things as it can while still matching. In this case, that is "name.can.be.different.20.03". It then matches the ".2009." against the "[_.][0-9][0-9][0-9][0-9][_.](.*)", and all is happy.

    How to correct this? Well, Perl and Java now support nongreedy quantifiers, which solves this problem. Without access to these, it depends on the constraints of our input. If, for instance, "name.can.be.different" is guaranteed to never have any numbers, we could use the following regular expression:
    Code:
    sed 's/\([^0-9]*\)[._]\([0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]\|[0-9][0-9].[0-9][0-9].[0-9][0-9]\|[0-9][0-9][0-9][0-9]\)[._]\(.*\)/\1/'
    The regular expression will now require that no numbers appear in the first statement. As a result, the "20.03" cannot possibly be included.

    Another possibility would be that if [0-9][0-9][0-9][0-9] is only allowed when _s are used, you can change the regular expression to reflect this. It will mean a more complicated regular expression, but it will more accurately reflect the possible inputs (you generally want your regular expressions to be as strict as possible).

    As a general rule, it's probably not a good idea to rely on the order within an alternation in any case, as different engines may not guarantee matching in a certain order.

    Hope this helps. Let us know what you decide to do.
    DISTRO=Arch
    Registered Linux User #388732

  3. #3
    Just Joined!
    Join Date
    Oct 2008
    Posts
    10
    Thank you for your help, this makes it more clear and as the name will never any numbers I've decided to use the solution you suggested.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...