Regular Expression Basics (in a Unix Shell)

Regular expressions are a powerful concept used for searching, filtering, and manipulating text. This article aims to uncover the basics of using regular expressions to extract information from files while working in a Unix shell. To this end, this post covers the basics of working with grep, sed & awk.

Of course, most high-level programming languages have native support for regular expressions too, including Perl, Python, Ruby, Java, JavaScript, PHP etc. Sometimes it can be quicker to do some quick tests on a throw-away file at a shell prompt to test a theory than firing up an environment to create a test, so it might also be useful for developers.

Regular expressions consist of literals and meta-characters. Literals are characters that simply match themselves, e.g. the alphanumeric characters, and some of the punctuation characters. Meta-characters alter the match in various ways. Some examples of meta-characters are:

  • ^A matches any pattern starting with A
  • B$ matches any pattern ending with B
  • A|B matches either A or B

Let’s start with looking at grep. By default it takes a regular expression pattern and outputs all lines in its input that matches this pattern. We’ll work on a file primes.txt containing a list of all the UK prime ministers over the years.

# All Ducal prime ministers:
grep Duke primes.txt

# All non-ducal primes:
grep -v Duke primes.txt

Here’s a couple of warm-up exercises:

  1. Count all the Earls. (Answer: 10)
  2. Count all the Earls, but ignore anyone with a tea named after them. You may find it useful to string two greps together with a pipe. (Answer: 9.)

But wait! There’s more than one way to skin a cat. Many Unix tools allow you to work with regular expressions, so the above can be achieved in different ways. For example, here’s how to replicate the behaviour of the two grep commands above with awk:

awk /Duke/ primes.txt
awk '!/Duke/' primes.txt

In awk we use ! to mean NOT, like in many other programming languages, rather than the ‘-v’ command line switch that grep requires. Note also the single quotes; these are required to avoid the shell interpreting the ! for its own purpose.

Bonus exercise for the reader, type echo !!, echo !$, or echo !:1 into a terminal and see what happens.

Duke above is a regular expression consisting entirely of literals. Thus it will only match that string. Also, the characters are not any of the ones that the shell will confuse with its own meta-characters, thus we do not need to quote it. However, let’s try something more interesting:

# All prime ministers with first names starting with an 'F'
grep '^F' primes.txt

# All primes with last names ending in 'n'
grep 'n$' primes.txt

Uh-oh. There’s a problem with that last example. It returns the following two:

Augustus Fitzroy, Duke of Grafton
Arthur Wellesley, Duke of Wellington

We’re matching their titles, not their last names. How do we fix that? A first attempt might see us trying to match ‘n,’ or ‘n$’:

egrep 'n,|n$' primes.txt

Notice how we use egrep in the above? It’s not a typo. egrep is a version of grep that allows extended regular expressions. (You can get the same effect by invoking grep with the -E switch.) This is required to allow the use of the | meta-character without having to escape it.

However, our regular expression still does not do what we want. Let’s introduce some more meta-characters:

  • . matches any character.
  • [abc] a character class that matches one of ‘a’, ‘b’ or ‘c’
    • [^abc] matches any character except ‘a’, ‘b’, ‘c’
  • * matches zero or more occurrences of the atom immediately preceding it. An atom could be any literal, character class, ., or group (we’ll see it later).

Say we still want to get that list of all the Dukes, but without their titles. The sed stream editor allows us to replace the occurrence of a regular expression pattern with a string, in a stream context. The syntax is s/{pattern}/{replacement}/. We can use this to remove anything after a comma (including the comma itself) on each line:

sed 's/,.*//' primes.txt

That gives us the last piece we need. It is left as an exercise for the reader to string together the required commands to get only the Dukes with names ending in n.

You could also try to:
1. Count all primes with names ending with an ‘l’. (Answer: 5.)
2. List the distinct first names of all the Earls. (Answer: Charles Edward George John Robert Spencer William)
3. Upgrade all the Earls to Dukes.

Regular expression capturing

Someone told me once that the word bookkeeping is the only word in English that has three consecutive groups of double letters in it. How would you go about testing that hypothesis? Can we use regular expression to do this? You bet! (For simplicity, we’ll assume that the English language consists exclusively of words found in /usr/share/dict/words.) Although we need some more meta-characters:

  • (abc) matches, and captures, the sequence abc.
  • \1, …, \N refers to a previous capture.
  • {N,M} enforces that the previous atom matches at least N and at most M times.

Some programs allows you to leave out N or M to simply say “at least N” or “at most M”.

We now have the tools to find any words containing three repeated groups of letters:

    # first (.) matches o, \1 repeats it, 
    # second (.) matches k, \2 repeats it, etc.
egrep '(.)\1(.)\2(.)\3' /usr/share/dict/words

# Alternative way to do the above:
egrep '((.)\2){3}' /usr/share/dict/words

You can also use group captures in the replacement text, like so:

# Replace "hurrah" with "hurrah hurrah hurrah!"
sed 's/(hurrah)/\1 \1 \1!/'

Exercises:

  1. Find any words (in /usr/share/dict/words) containing a sequence of XX…YY…XX where X and Y is any letter, and … is any (optionally empty) sequence of letters. (Answer: On Mountain Lion I found 21. It might be different on other versions of OS X.)
  2. Find any prime ministers with a 7-character sequence of characters repeated in their name/title. (Answer: 2.)
  3. Transform any lines containing “Name,Title” into “Title (Name)”.

Hope you enjoyed this post!

Print Friendly

Leave a Reply