Results 1 to 3 of 3
From my understanding when using regex1|regex2 the matching process tries each alternative in turn, from left to right, and the first one that succeeds is used.
When im trying to ...
- 12-25-2008 #1Just Joined!
- Join Date
- Oct 2008
- Posts
- 10
sed - using regex and | need help
From my understanding when using regex1|regex2 the matching process tries each alternative in turn, from left to right, and the first one that succeeds is used.
When im trying to extract the name from those examples:
A) name.can.be.different.20.03.2009.boom
B) name.can.be.different.20.03.09.boom
C) name_can_be_different_2009_boom
by using:
It gives:Code:sed 's/\(.*\)[._]\([0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]\|[0-9][0-9].[0-9][0-9].[0-9][0-9]\|[0-9][0-9][0-9][0-9]\)[._]\(.*\)/\1/'
A) name.can.be.different.20.03 (WRONG)
B) name.can.be.different (GOOD)
C) name_can_be_different (GOOD)
Why does it fail on A) ? By my order of regex pattern it's suppose to match on
[0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9] but it doesn't ?
- 12-25-2008 #2
Regular expressions use something called greedy quantifiers. What this means is that a * or + will match as many things as it possibly can, so long as this generates a match.
So with A), what you are seeing is that the .* grabs as many things as it can while still matching. In this case, that is "name.can.be.different.20.03". It then matches the ".2009." against the "[_.][0-9][0-9][0-9][0-9][_.](.*)", and all is happy.
How to correct this? Well, Perl and Java now support nongreedy quantifiers, which solves this problem. Without access to these, it depends on the constraints of our input. If, for instance, "name.can.be.different" is guaranteed to never have any numbers, we could use the following regular expression:
The regular expression will now require that no numbers appear in the first statement. As a result, the "20.03" cannot possibly be included.Code:sed 's/\([^0-9]*\)[._]\([0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]\|[0-9][0-9].[0-9][0-9].[0-9][0-9]\|[0-9][0-9][0-9][0-9]\)[._]\(.*\)/\1/'
Another possibility would be that if [0-9][0-9][0-9][0-9] is only allowed when _s are used, you can change the regular expression to reflect this. It will mean a more complicated regular expression, but it will more accurately reflect the possible inputs (you generally want your regular expressions to be as strict as possible).
As a general rule, it's probably not a good idea to rely on the order within an alternation in any case, as different engines may not guarantee matching in a certain order.
Hope this helps. Let us know what you decide to do.DISTRO=Arch
Registered Linux User #388732
- 12-25-2008 #3Just Joined!
- Join Date
- Oct 2008
- Posts
- 10
Thank you for your help, this makes it more clear and as the name will never any numbers I've decided to use the solution you suggested.


Reply With Quote