Grouping techniques

Character classes

A character class can be used to find a single character that matches any one of a given set of characters.

Let's say you're looking for occurences of the word "grey" in text, then remember that the American spelling is "gray". The way we can do this is by using character classes. Character classes are specified using square brackets, thus: /gr[ea]y/

We can also use character sequences by saying things like [A-Z] or [0-9]. The sequences \d and \w can easily be expressed as character classes: [0-9] and [a-zA-Z0-9_] respectively.

We can negate a character class by putting a caret at the start of it. That's right, the same character that we used to match the start of the line. Larry Wall has written that Perl does anything you want -- unless you want consistency, and it has also been said that consistency is the hobgoblin of small minds. Therefore, we'll learn about these character class inconsistencies, learn to love them, and flatter ourselves that we do not have small minds.

Here are some of the special rules that apply inside character classes. I make no guarantee that this is a complete list; additions are always welcome.

Exercises

Your trainer will help you do the following exercises as a group.

  1. How would we find any word starting with a letter in the first half of the alphabet, or with X, Y, or Z?

  2. What regular expression could be used for any word that starts with letters other than those listed in the previous example.

  3. There's almost certainly a problem with the regular expression we've just created - can you see what it might be?

Alternation

The problem with character classes is that they only match one character. What if we wanted to match any of a set of longer strings, like a set of words?

The way we do this is to use the pipe symbol | for alternation:

/cat|dog|budgie/                # matches any of our pets

Now we come up against another problem. If we write something like:

/^cat|dog|budgie$/

...to match any of our pets on a line by itself, what we're actually matching is: "the start of the string followed by cat; or dog; or budgie followed by the end of the string". This is not what we originally intended. To fix this, we enclose our alternation in round brackets:

/^(cat|dog|budgie)$/

# a simple matching program to get some email headers and print them out

while (<>) {
        print if /^(From|Subject|Date):\s/;
}

The above email example can be found in exercises/mailhdr.pl.

The concept of atoms

Round brackets bring us neatly into the concept of atoms. The word "atom" derives from the Greek atomos meaning "indivisible" (little did they know!). What we use it to mean is "something that is a chunk of regular expression in its own right" -- as opposed to "something that can wipe out cities with a single blast".

Atoms can be arbritrarily created by simply wrapping things in round brackets - handy for indicating grouping, using quantifiers for the whole group at once, and for indicating which bit(s) of a matching function should be the returned value (but we'll deal with that later).

In the example above, there are three atoms:

  1. start of line

  2. cat or dog or budgie

  3. end of line

How many atoms were there in our dollar prices example earlier?

Atomic groupings can have quantifiers attached to them. For instance:

# match a consonant followed by a vowel twice in a row
# eg "tutu"
/([^aeiou][aeiou]){2}/

# match three or more words starting with "a" in a row
# eg "all angry animals"
/(\ba\w+\b\s*){3,}