Welcome to Linux Forums! With a comprehensive Linux Forum, information on various types of Linux software and many Linux Reviews articles, we have all the knowledge you need a click away, or accessible via our knowledgeable members.
Find the answer to your Linux question:
Site Navigation
Linux Forums
Linux Articles
Product Showcase
Linux Downloads
Linux Hosting
Free Magazines
Job Board
IRC Chat
RSS Feeds
Free Publications


In this article a simple usage of regular expressions is described. Its intention is to bring users to try the most powerful search and replace paradigm available and hopefully start using it.
This however can not replace good tutorials available on the sites that are also mentioned in this article. The article is written reproducing actual steps I took to complete my task, to show the specifics and possible problems.

There is nothing harder then starting to learn something from the beginning. This is even harder if the thing that needs mastering is also something that requires logic and deeper understanding and not just following the protocols. The difference between understanding and knowing is enormous. Real understanding will never let you down, but it might require longer period of time to acquire. It will last longer and it's easier to upgrade if the situation you find yourself in, requires so. Some time ago I found one of these nuts that needed to be cracked and it involved a simple search and replace, that I could turn around in my mind in anyway I wanted, but could not do it in a real situation. I could ask a guru for a quick solution, but then I would need to ask again every time I encounter something new. It was time to sit down and start learning.

The first problem I faced, that needed a bit more advanced approach to search and replace, was when I tried to start translating a freshly released version of some Linux application, but could not find a *.pot file. This *.pot file is nothing more than a text file with so called translation messages, where every string that is used by the program in the original language is matched with another string that needs to be translated. It looks like this:

msgid "some text to translate, usually written in English"
msgstr "a translated text into a different language"

The lines declared with "msgid" are original messages and the lines starting with "msgstr" are the translated ones. The story goes like this: I was searching for an empty *.pot file on the internet, but couldn't find one and I had no idea how to empty already translated file. The translation software that I was using did not have such an option. The problem was that since I could not translate the whole file at once I somehow needed to distinguish between those translated in some other language and those that I translated into my language. Unfortunately the program I used also did not distinguish between languages, but only between untranslated, fuzzy and translated messages. I remembered that I once read about some strange expressions that could do that faster then me spending the time on searching for an empty pot file or the time needed to find the application that can empty the file. I could do it by hand, but emptying more then thousand messages is more then just a few minute job. I opened an editor and started my usual approach to the problem typing different search expressions using wildcards and looking for appropriate replacement string. The replacement was not a big problem since I knew, that I need all the "msgstr" emptied. The common translation string of the empty *.pot file looks like this:

msgid "some text to translate, usually written in English"
msgstr ""

The real problem was the search string. It needed to find any character and any number of characters between the upper quotes. After applying some strings unsuccessfully I realized, that it's maybe time to check some documentation or tutorial. I knew that the search expression I was looking for should be a so called regular expression, but at the time I knew nothing about it. After searching regular expressions I got quite a few hits and not unexpectedly the first one was the one I thought, I was looking for. I found this string that was suppose to start at some position, check and mark every occurrence of all characters and end the line. I searched for:

msgstr "b[A-Z0-9._%-]"b

As explained on the web page the sting first was suppose to set the literal characters msgstr " then start to do the logical part with a special b that matches word boundaries, which are the positions between a character at the beginning and at the end of the string, then advances to the notorious [A-Z0-9._%-] which says that every character from A-Z and 0-9, even dot, minus, underscore and percent sign is considered, and at last ends with another " which is also the end of the translated string. Of course the word boundary is ended with another b. I ran the sting in the text editor and to my surprise it didn't translate anything. It must have been wrong somehow. Copying and pasting a search string from a webpage did not produce good results. A bit intrigued by the problem I really needed to get into the logic of the regular expressions. I pulled down a meta-character map and started reading. It was easy.

I immediately saw, that I need to accept a few terms constantly in use and needed to distinguish among them. First were literal strings that are, as the name says, literal. To make it straightforward and easier to remember, imagine that those are characters you usually want to find in your text like numbers and letters. Then there were metacharacters or special characters that have some special meaning in regular expressions and are usually not found in the text you write. e.g the ^ (caret), * (asterisk) or / (slash). If you want to include these special characters in the literal search string, you must remember that you need to escape them with a (backslash) e.g the ^ (caret) or * (asterisk). The last were search expressions that are these logical strings that help us find the pattern and are made with literal characters and metacharacters.

Over simplified definition of a regular expression would be that it is a "pattern that describes a certain amount of text. The word describes is the key word here. This means that if I'm able to describe a certain amount of text with my words, then I'm somehow able to write this description in terms of logical pattern, that will find, select, replace, count or do something else I need, with the found matches. Using the following example of literal characters would find all occurrences of the word "msgstr" in this article (1), but will fail to find all the translation plurals of the same message. These are followed by the [n] and look like msgstr[n], where n is the number of plurals starting the count from 0. Actually it will find them, since the "msgstr" matches all characters except the following [n] but will not select it, since the literal string does not say that we should proceed further. To only find the literal word we could set the boundaries (2). To search for both msgid and msgstr we would need to apply logical string (3). To find only multiple plural translations we could search for the string that allows only numbers (4). Note that metacharacters [ and ] are escaped [ and ] to make them literal characters! It may look strange, but it's easy to use. In addition, if I need to find only the first and the second plural I could use the so called character class distinguished by square brackets (5) or to define only the plurals I have from 0 to 3, I could use the range (6). To skip the third translation of the plural I could simply write the expression without the number 2 (7). Yes, you noticed correctly that plurals are counted from zero [0] and not one, like most things in computer sciences. Remember that typing a caret after the opening square bracket will negate the string.

1. msgstr 2. bmsgstrb
3. msg(id|str)
4. msgstr[d]
5. msgstr[[01]]
6. msgstr[[0-3]]
7. msgstr[[013]]

The next thing that caught my thoughts were the anchors. These strings do not match any characters, but define a position. The caret matches the start of the string and dollar sign the end of the string. The b roughly speaking defines the word boundary or more precisely a position between a word character and non-word character. It also matches the start and the end of the string if the two are the first and the last characters in the string. There was enough explaining and I was ready to start again. This time I had a simple table to help me work out this search string.

MetacharacterDescription


[ ] (square brackets)matches any character inside the brackets. Range can be specified using a minus [0-9]
() (round brackets) are used for grouping together patterns.
{} (curled brackets) specify a specific amount of repetition.
[^ ] (brackets)negates any character inside the brackets.
. (dot)matches any single character
$ (dollar)matches the end of a line. Nothing is after it.
^ (caret)matches the beginning of a line. Nothing is before it.
? (questionmark)matches zero or one occurence of the character immediately after
* (asterisk)matches zero or more repeats immediately after.
+ (plus)matches one or more occurences immediately after
(backslash)this is the escape character to match metacharacters.
| (pipe)is a Bolean operator for OR.
SequenceDescription


b matches a word boundary or the position between a word and a space. The B is a negative of b.
d matches a digit. it is equivalent to [0-9] while D is the negative of d equivalent to [^0-9].
n matches a newline character also called the end of the line character.
s matches white space e.g. space, tab, form-feed. The S is negative of s.
t and v match a horizontal tab t or vertical tab v character.
w matches any word character and underscore. it is equivalent to [A-Za-z0-9_]. Again W negates the w.

As the table says I first needed to set the start of the line. For that I needed to use a caret ^ sign. Where this sign is placed it tells the search engine to look only at the beginning of the line for the target string. I needed to find the lines that start with msgstr, so this would be a good choice. Then it's obvious that I need to distinguish between msgid and msgstr lines, so I proceeded with adding literal string msgstr. Then after this literal search string I needed to tell the engine to find any text character that follows this string and also any number of characters following. How can I otherwise know how many are there between upper quotes. Reason for starting "whatever follows" search here and not after the first upper quote is because the next character in place can be a whitespace, upper quote or even starting bracket in case of multiple plurals. To find almost any character I needed to add a dot .. So now I have any character following my msgstr and only need to repeat it indefinitely. That was a simple one, since I remembered that the wildcard * usually means everything and anything. In regular expressions this means repeat zero or more times, which means that my "whatever follows" character is repeated until it reaches the end. In my case the end was the end of the line. The string I executed in search string was finally made and was looking less complicated than the one I merely pasted from a web page.

search for: ^msgstr.*

replace with: msgstr ""

I successfully replaced all occurrences which resulted in almost complete replacement of all strings except for those multi-line translations that have a bit different structure. In the special multi line translations string does not start with msgstr, but with the upper quote.

msgid ""
"This is again English textn"
"written in two lines"
msgstr ""
"and this is translated string n"
"again written in two lines "

There is a trick, though. If you read this article thoroughly, then you probably found, that in my case I also corrupted all my multiple plurals, since I also replaced the [n] with the upper quotes. I really did it, but fortunately I had a undo option and far more clarified understanding how powerful these logical expressions are. Adding just one literal space and upper quote did the job. I did not play with multiple plurals any further. It was enough for the successful day.

search for: ^msgstr ".*

replace with: msgstr ""

To conclude a regular expression sometimes also abbreviated as regex or regexp is just a logical text string for describing a search pattern. Regular expressions can help you make magic. They work similar to wildcards like * and ? only more advanced. After this first encounter and successfully accomplished mission with the logical expressions I got confident and have used it many times. Interestingly now I mostly use it when building a webpage which is kind of similar to translating Linux applications.

You can find additional tutorials and howtos on:

  • Introduction to the RegEx Tutorial
  • Tao of Regular Expressions
  • Regular Expressions - User guide
  • Regular Expressions Info Site
  • Regular Expressions on Wikipedia
  • Rate This Article: poorexcellent
     
    Comments about this article
    Mr
    writen by: Mark on 2007-03-05 07:47:45
    I need to know how to extract text from a string using a regular expression and regexp or regsub etc.. I have an expression as ".*.txt" which is an expression which will filter out the text files in a string. I open the file with the string in it, no problem, I split the string into lines, no problem, now when I run : set res [regexp .*.txt $line match] for each line, I get a 1 or a 0 saying the text matches the expression. What I really want to do is actually get the filename from the line of data from the string - how would I do this?
    RE: Mr written by Mark:

    Comment title: * please do not put your response text here