Using Regular Expressions On The Linux Command Line

Tuesday, 4. August 2009

Using Regular Expressions (RegEx) on the command line.

Questions about regular expressions come up at the Lug meetings on a regular basis. Here are some examples of regex commands I use all the time. Hope you find them useful.

Parse a file skipping commented lines.

gateway:~# egrep '^[^#]' /etc/manpath.config
MANDATORY_MANPATH			/usr/man
MANDATORY_MANPATH			/usr/share/man
MANDATORY_MANPATH			/usr/local/share/man
[...] more lines

#### Let's use to wordcount to count the lines returned:

gateway:~# egrep '^[^#]' /etc/manpath.config | wc -l
23

gateway:~# cat /etc/manpath.config | wc -l
114

So let’s look at the command. The egrep command is just grep that allows regex searches. So it will search through a file and when it finds a match to the regex, it will return the entire line that contains that match. It returns nothing if there is no match on that line. Our regex is ‘^[^#]’ , which reads like this:

  • The first “^” means the rest of the regex must begin at the start of the line.
  • Any single character enclosed in “[]” will match.
  • unless the stuff in the brackets start with “^” which means the reverse, or everything but the stuff in the brackets will match.
  • So, “^[^#]” means every line that doesn’t start with “#” will match.

The wc command is just a utility that will count words or lines.

Using egrep to find files that contain my home directory.

stu@linus:~$ egrep -l '\/h\/stu' *
HostingContract.ott
Scale6Vlan-Master0.odb

### Notice we had to escape the forward slashes

We are again using egrep, but we are changing what it returns when it finds a match by adding the “-l” switch. This switch causes egrep to return the filename of the files that contain the regex. My home directory on the system I ran the command on is ‘/h/stu’. We need to “escape” the “/”s in order to have egrep ignore the special meaning of the “/” character.

Using egrep and sed to find all routes to a class C address.

### The output we are modifying:
209.84.9.248/29 via 209.84.10.2 dev eth0  proto zebra  metric 20
209.84.9.240/29 via 209.84.10.2 dev eth0  proto zebra  metric 20
209.84.9.160/27 dev eth4  proto kernel  scope link  src 209.84.9.161
209.84.9.0/26 dev eth1  proto kernel  scope link  src 209.84.9.1
209.84.9.64/26 dev eth3  proto kernel  scope link  src 209.84.9.65 

### The results

border2:~# ip route | egrep '209\.84\.9' \
        | sed 's/^\(209\.84\.9\.[0-9]\{1,3\}\/[0-9]\{1,2\}\).*$/\1/'
209.84.9.248/29
209.84.9.240/29
209.84.9.160/27
209.84.9.0/26
209.84.9.64/26

### Now let's do the same thing, but with just sed

border2:~# ip route \
        | sed -e 's/^\(209\.84\.9\.[0-9]\{1,3\}\/[0-9]\{1,2\}\).*$/\1/' \
        -e '/^209\.84\.9/!d'
209.84.9.248/29
209.84.9.240/29
209.84.9.160/27
209.84.9.0/26
209.84.9.64/26

Ok, let’s start with the egrep command. As you can see, we are using the escape character again. This time, we are escaping the “.”. We need to do this because “.” has a special meaning. It means 0 or more of any character. So, if we want to search for an actual period, we need to escape it. The “.” meta character is considered greedy because it will match anything, including nothing.

Now, the sed command I’m using is a but more complicated, but once you understand what I’m doing, it should become clear. Here is an overview of the meta characters and modifiers I’m using.

I am constructing a replacement string denoted by the “s” at the beginning of the expression. s/<match data>/<replace with>/

“^” means start the match at the beginning of the line.

Everything enclosed in the \( stuff to match \) is saved so you can reuse it later with “\1”.

The meaning of “209\.84\.9\.” is nothing more then the first part of the network address I want to display when I’m done.

This however: “[0-9]\{1,3\}\/[0-9]\{1,2\}” might require a little thought to understand. Let’s start with what we already know: “[0-9]” is a number between 0-9. That’s simple, but what about “\{1,3\}”? Well, that means that I want the 0-9 number to occur at least once, but no more then 3 times before the next character to match occurs, which is “\/” which is actually a “/”. And, after that, “[0-9]\{1,2\}” which we explained above.

That completes all the stuff we want to keep for later, so the next part of the string is “\)”. This is followed by a really greedy expression: “.*”. We talked about the “.” earlier, but the “*” is new. The “*” means o or more of the previous character. and since the previous character is a “.”, that means anything or nothing.

We finish the match portion up with a “$” which reminds you to send me a check. Not really, just seeing if you are awake, it means the end of the line.

So let’s try writing the match portion of it in simple terms:

start of line – 209.84.9. – one to three numbers / – one to two numbersanything at all to end of line

Now that we got through that, the last part is what we are going to substitute it with. All I want is the network part of the line which is in the “\(\)” brackets, and I can access that using the “\1” macro. Which is what I have done above.

Last but not least, I used an example above that is completely done in sed. The regex  “/209\.84\.9\./!d” was added to the sed command to look for the info we wanted. Usually a “d” would delete the entire line of a match, but adding the “!” causes it only to delete lines that do not match.

I hope this sparks some interest in regular expressions. For a Unix administrator, the regular expression is a life saver. Here are a couple books that will help you get started: Mastering Regular Expressions, By Jeffrey E. F. Friedl.

— Stu

Share

Leave a Reply

You must be logged in to post a comment.