The special characters we use in regular expressions are called metacharacters, because they are characters that describe other characters.
Table 8-2. Regular expression metacharacters
Metacharacter(s) | Matches... |
---|---|
^ | Start of string |
$ | End of string |
. | Any single character except \n (though special things can happen in multiline mode) |
\n | Newline (subtly different to $ - when working in multiline mode, there may be newlines embedded in the multiline string you're working with. |
\t | Matches a tab |
\s | Any whitespace character, such as space or tab |
\S | Any non-whitespace character |
\d | Any digit (0 to 9) |
\D | Any non-digit |
\w | Any "word" character - alphanumeric plus underscore (_) |
\W | Any non-word character |
\b | A word break - the zero-length point between a word character (as defined above) and a non-word character. |
These and other metacharacters are all outlined in chapter 2 of the Camel book and in the perlre manpage - type perldoc perlre to read it.
Any character that isn't a metacharacter just matches itself. If you want to match a character that's normally a metacharacter, you can escape it by preceding it with a backslash
Some quick examples:
# Perl regular expressions are usually found within slashes - the # matching operator/function which we will see soon. /cat/ # matches the three characters # c, a, and t in that order. /^cat/ # matches c, a, t at start of line /\scat\s/ # matches c, a, t with spaces on # either side /\bcat\b/ # same as above, but won't # include the spaces in the text # it matches # we can interpolate variables just like in strings: my $animal = "dog" # we set up a scalar variable /$animal/ # matches d, o, g /$animal$/ # matches d, o, g at end of line /\$\d\.\d\d/ # matches a dollar sign, then a # digit, then a dot, then # another digit, then another # digit, eg $9.99 |
What if, in our last example, we'd wanted to say "Match a dollar, then any number of digits, then a dot, then two more digits"? What we need are quantifiers.
Table 8-3. Regular expression quantifiers
Quantifier | Meaning |
---|---|
? | 0 or 1 |
* | 0 or more |
+ | 1 or more |
{n} | match exactly n times |
{n,} | match n or more times |
{n,m} | match between n and m times |
Regular expressions are, by default, "greedy". This means that any regular expression, for instance .*, will try to match the biggest thing it possibly can. Greediness is sometimes referred to as "maximal matching".
To change this behaviour, follow the quantifier with a question mark, for example .*?. This is sometimes referred to as "minimal matching".
$string = "abracadabra"; /a.*a/ # greedy -- matches "abracadabra" /a.*?a/ # not greedy -- matches "abra" |
You now know enough to work out the price example above. Work it through.
Another example: what regular expression would match the word "colour" with either British or American spellings?
How can we match any four-letter word?