Metacharacters

The special characters we use in regular expressions are called metacharacters, because they are characters that describe other characters.

Some easy metacharacters

Table 8-2. Regular expression metacharacters

Metacharacter(s) Matches...
^ Start of string
$ End of string
. Any single character except \n (though special things can happen in multiline mode)
\n Newline (subtly different to $ - when working in multiline mode, there may be newlines embedded in the multiline string you're working with.
\t Matches a tab
\s Any whitespace character, such as space or tab
\S Any non-whitespace character
\d Any digit (0 to 9)
\D Any non-digit
\w Any "word" character - alphanumeric plus underscore (_)
\W Any non-word character
\b A word break - the zero-length point between a word character (as defined above) and a non-word character.

These and other metacharacters are all outlined in chapter 2 of the Camel book and in the perlre manpage - type perldoc perlre to read it.

Any character that isn't a metacharacter just matches itself. If you want to match a character that's normally a metacharacter, you can escape it by preceding it with a backslash

Some quick examples:

# Perl regular expressions are usually found within slashes - the
# matching operator/function which we will see soon. 

/cat/                                   # matches the three characters
                                        # c, a, and t in that order.
/^cat/                                  # matches c, a, t at start of line
/\scat\s/                               # matches c, a, t with spaces on
                                        # either side
/\bcat\b/                               # same as above, but won't
                                        # include the spaces in the text
                                        # it matches

# we can interpolate variables just like in strings:

my $animal = "dog"                      # we set up a scalar variable
/$animal/                               # matches d, o, g
/$animal$/                              # matches d, o, g at end of line

/\$\d\.\d\d/                            # matches a dollar sign, then a
                                        # digit, then a dot, then
                                        # another digit, then another
                                        # digit, eg $9.99

Quantifiers

What if, in our last example, we'd wanted to say "Match a dollar, then any number of digits, then a dot, then two more digits"? What we need are quantifiers.

Table 8-3. Regular expression quantifiers

Quantifier Meaning
? 0 or 1
* 0 or more
+ 1 or more
{n}match exactly n times
{n,}match n or more times
{n,m} match between n and m times

Greediness

Regular expressions are, by default, "greedy". This means that any regular expression, for instance .*, will try to match the biggest thing it possibly can. Greediness is sometimes referred to as "maximal matching".

To change this behaviour, follow the quantifier with a question mark, for example .*?. This is sometimes referred to as "minimal matching".

$string = "abracadabra";

/a.*a/                # greedy -- matches "abracadabra"
/a.*?a/               # not greedy -- matches "abra"

Exercises

  1. You now know enough to work out the price example above. Work it through.

  2. Another example: what regular expression would match the word "colour" with either British or American spellings?

  3. How can we match any four-letter word?