Previous: Regexp Summary, Up: Regexps [Contents][Index]
The syntax and semantics of PCRE regular expressions, as used in Monotone, are described in detail below. Regular expressions in general are covered in a number of books, some of which have copious examples. Jeffrey Friedl’s “Mastering Regular Expressions,” published by O’Reilly, covers regular expressions in great detail. This description is intended as reference material.
A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. When caseless matching is specified, letters are matched independently of case.
The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of metacharacters, which do not stand for themselves but instead are interpreted in some special way.
There are two different sets of metacharacters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized within square brackets. Outside square brackets, the metacharacters are as follows:
\general escape character with several uses
^assert start of string (or line, in multiline mode)
$assert end of string (or line, in multiline mode)
.match any character except newline (by default)
[start character class definition
|start of alternative branch
(start subpattern
)end subpattern
?extends the meaning of ‘(’ also 0 or 1 quantifier also quantifier minimizer
*0 or more quantifier
+1 or more quantifier also “possessive quantifier”
{start min/max quantifier
Part of a pattern that is in square brackets is called a "character class". In a character class the only metacharacters are:
\general escape character
^negate the class, but only if the first character
-indicates character range
[POSIX character class (only if followed by POSIX syntax)
]terminates the character class
The following sections describe the use of each of the metacharacters.
The backslash character has several uses. Firstly, if it is followed by a non-alphanumeric character, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes.
For example, if you want to match a ‘*’ character, you write ‘\*’ in the pattern. This escaping action applies whether or not the following character would otherwise be interpreted as a metacharacter, so it is always safe to precede a non-alphanumeric with backslash to specify that it stands for itself. In particular, if you want to match a backslash, you write ‘\\’.
If a pattern is compiled with the ‘(?x)’ option, whitespace in the pattern (other than in a character class) and characters between a ‘#’ outside a character class and the next newline are ignored. An escaping backslash can be used to include a whitespace or ‘#’ character as part of the pattern.
If you want to remove the special meaning from a sequence of characters, you can do so by putting them between ‘\Q’ and ‘\E’. The ‘\Q...\E’ sequence is recognized both inside and outside character classes.
A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents:
\aalarm, that is, the BEL character (hex 07)
\cx"control-x", where x is any character
\eescape (hex 1B)
\fformfeed (hex 0C)
\nlinefeed (hex 0A)
\rcarriage return (hex 0D)
\ttab (hex 09)
\dddcharacter with octal code ddd, or backreference
\xhhcharacter with hex code hh
\x{hhh...}character with hex code hhh...
The precise effect of ‘\cx’ is as follows: if x is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus ‘\cz’ becomes hex 1A (the SUB control character, in ASCII), but ‘\c{’ becomes hex 3B (‘;’), and ‘\c;’ becomes hex 7B (‘{’).
After ‘\x’, from zero to two hexadecimal digits are read (letters can be in upper or lower case). Any number of hexadecimal digits may appear between ‘\x{’ and ‘}’, but the value of the character code must be less than 256 in non-UTF-8 mode, and less than 231in UTF-8 mode. That is, the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code point, which is 10FFFF.
If characters other than hexadecimal digits appear between ‘\x{’ and ‘}’, or if there is no terminating ‘}’, this form of escape is not recognized. Instead, the initial ‘\x’ will be interpreted as a basic hexadecimal escape, with no following digits, giving a character whose value is zero.
Characters whose value is less than 256 can be defined by either of the two syntaxes for ‘\x’. There is no difference in the way they are handled. For example, ‘\xdc’ is exactly the same as ‘\x{dc}’.
After ‘\0’ up to two further octal digits are read. If there are fewer than two digits, just those that are present are used. Thus the sequence ‘\0\x\07’ specifies two binary zeros followed by a BEL character (octal 007). Make sure you supply two digits after the initial zero if the pattern character that follows is itself an octal digit.
The handling of a backslash followed by a digit other than 0 is complicated. Outside a character class, PCRE reads it and any following digits as a decimal number. If the number is less than 10, or if there have been at least that many previous capturing left parentheses in the expression, the entire sequence is taken as a back reference. A description of how this works is given later, following the discussion of parenthesized subpatterns.
Inside a character class, or if the decimal number is greater than 9 and there have not been that many capturing subpatterns, PCRE re-reads up to three octal digits following the backslash, and uses them to generate a data character. Any subsequent digits stand for themselves. In non-UTF-8 mode, the value of a character specified in octal must be less than ‘\400’. In UTF-8 mode, values up to ‘\777’ are permitted. For example:
\040is another way of writing a space
\40is the same, provided there are fewer than 40 previous capturing subpatterns
\7is always a back reference
\11might be a back reference, or another way of writing a tab
\011is always a tab
\0113is a tab followed by the character ‘3’
\113might be a back reference, otherwise the character with octal code 113
\377might be a back reference, otherwise the byte consisting entirely of 1 bits
\81is either a back reference, or a binary zero followed by the two characters ‘8’ and ‘1’
Note that octal values of 100 or greater must not be introduced by a leading zero, because no more than three octal digits are ever read.
All the sequences that define a single character value can be used both inside and outside character classes. In addition, inside a character class, the sequence ‘\b’ is interpreted as the BS character (hex 08), and the sequences ‘\R’ and ‘\X’ are interpreted as the characters ‘R’ and ‘X’, respectively. Outside a character class, these sequences have different meanings (see below).
The sequence ‘\g’ followed by an unsigned or a negative number, optionally enclosed in braces, is an absolute or relative back reference. A named back reference can be coded as ‘\g{name}’. Back references are discussed later, following the discussion of parenthesized subpatterns.
Another use of backslash is for specifying generic character types. The following are always recognized:
\dany decimal digit
\Dany character that is not a decimal digit
\hany horizontal whitespace character
\Hany character that is not a horizontal whitespace character
\sany whitespace character
\Sany character that is not a whitespace character
\vany vertical whitespace character
\Vany character that is not a vertical whitespace character
\wany “word” character
\Wany “non-word” character
Each pair of escape sequences partitions the complete set of characters into two disjoint sets. Any given character matches one, and only one, of each pair.
These character type sequences can appear both inside and outside character classes. They each match one character of the appropriate type. If the current matching point is at the end of the subject string, all of them fail, since there is no character to match.
For compatibility with Perl, ‘\s’ does not match the VT character (code 11). This makes it different from the the POSIX “space” class. The ‘\s’ characters are TAB (9), LF (10), FF (12), CR (13), and SPACE (32).
In UTF-8 mode, characters with values greater than 128 never match ‘\d’, ‘\s’, or ‘\w’, and always match ‘\D’, ‘\S’, and ‘\W’. These sequences retain their original meanings from before UTF-8 support was available, mainly for efficiency reasons.
The sequences ‘\h’, ‘\H’, ‘\v’, and ‘\V’ are Perl 5.10 features. In contrast to the other sequences, these do match certain high-valued codepoints in UTF-8 mode. The horizontal space characters are:
U+0009Horizontal tab
U+0020Space
U+00A0Non-break space
U+1680Ogham space mark
U+180EMongolian vowel separator
U+2000En quad
U+2001Em quad
U+2002En space
U+2003Em space
U+2004Three-per-em space
U+2005Four-per-em space
U+2006Six-per-em space
U+2007Figure space
U+2008Punctuation space
U+2009Thin space
U+200AHair space
U+202FNarrow no-break space
U+205FMedium mathematical space
U+3000Ideographic space
The vertical space characters are:
U+000ALinefeed
U+000BVertical tab
U+000CFormfeed
U+000DCarriage return
U+0085Next line
U+2028Line separator
U+2029Paragraph separator
A “word” character is an underscore or any character less than 256 that is a letter or digit. The definition of letters and digits is that used for the “C” locale.
PCRE supports five different conventions for indicating line breaks in strings: a single CR (carriage return) character, a single LF (linefeed) character, the two-character sequence CRLF, any of the three preceding, or any Unicode newline sequence. The default is to match any Unicode newline sequence. It is possible to override the default newline convention by starting a pattern string with one of the following five sequences:
(*CR)carriage return
(*LF)linefeed
(*CRLF)carriage return, followed by linefeed
(*ANYCRLF)any of the three above
(*ANY)all Unicode newline sequences
For example, the pattern
(*CR)a.b
changes the convention to CR. That pattern matches ‘a\nb’ because LF is no longer a newline. Note that these special settings, which are not Perl-compatible, are recognized only at the very start of a pattern, and that they must be in upper case. If more than one of them is present, the last one is used.
The newline convention does not affect what the ‘\R’ escape sequence matches. By default, this is any Unicode newline sequence, for Perl compatibility. However, this can be changed; see the description of ‘\R’ below. A change of ‘\R’ setting can be combined with a change of newline convention.
Outside a character class, by default, the escape sequence ‘\R’ matches any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode ‘\R’ is equivalent to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character
sequence CR followed by LF, or one of the single
characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (formfeed, U+000C), CR (carriage
return, U+000D), or NEL (next line, U+0085). The
two-character sequence is treated as a single unit that cannot be
split. In UTF-8 mode, two additional characters whose codepoints are
greater than 255 are added: LS (line separator, U+2028)
and PS (paragraph separator, U+2029).
It is possible to change the meaning of ‘\R’ by starting a pattern string with one of the following sequences:
(*BSR_ANYCRLF)CR, LF, or CRLF only
(*BSR_UNICODE)any Unicode newline sequence (the default)
Note that these special settings, which are not Perl-compatible, are recognized only at the very start of a pattern, and that they must be in upper case. If more than one of them is present, the last one is used. They can be combined with a change of newline convention, for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
Inside a character class, ‘\R’ matches the letter ‘R’.
Three additional escape sequences match characters with specific Unicode properties. When not in UTF-8 mode, these sequences are of course limited to testing characters whose codepoints are less than 256, but they do work in this mode. The extra escape sequences are:
\p{xx}a character with the xx property
\P{xx}a character without the xx property
\Xan extended Unicode sequence
The property names represented by xx above are limited to the Unicode script names, the general category properties, and ‘Any’, which matches any character (including newline). Other properties such as ‘InMusicalSymbols’ are not currently supported by PCRE. Note that ‘\P{Any}’ does not match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts. A character from one of these sets can be matched using a script name. For example:
\p{Greek}
\P{Han}
Those that are not part of an identified script are lumped together as “Common.” The current list of scripts is:
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
Each character has exactly one general category property, specified by a two-letter abbreviation. For compatibility with Perl, negation can be specified by including a circumflex between the opening brace and the property name. For example, ‘\p{^Lu}’ is the same as ‘\P{Lu}’.
If only one letter is specified with ‘\p’ or ‘\P’, it includes all the general category properties that start with that letter. In this case, in the absence of negation, the curly brackets in the escape sequence are optional; these two examples have the same effect:
\p{L}
\pL
The following general category property codes are supported:
COther
CcControl
CfFormat
CnUnassigned
CoPrivate use
CsSurrogate
LLetter
LlLower case letter
LmModifier letter
LoOther letter
LtTitle case letter
LuUpper case letter
MMark
McSpacing mark
MeEnclosing mark
MnNon-spacing mark
NNumber
NdDecimal number
NlLetter number
NoOther number
PPunctuation
PcConnector punctuation
PdDash punctuation
PeClose punctuation
PfFinal punctuation
PiInitial punctuation
PoOther punctuation
PsOpen punctuation
SSymbol
ScCurrency symbol
SkModifier symbol
SmMathematical symbol
SoOther symbol
ZSeparator
ZlLine separator
ZpParagraph separator
ZsSpace separator
The special property ‘L&’ is also supported: it matches a character that has the ‘Lu’, ‘Ll’, or ‘Lt’ property, in other words, a letter that is not classified as a modifier or “other.”
The ‘Cs’ (Surrogate) property applies only to characters in the
range U+D800 to U+DFFF. Such characters are not valid in
UTF-8 strings (see RFC 3629) and so cannot be tested by PCRE.
The long synonyms for these properties that Perl supports (such as ‘\p{Letter}’) are not supported by PCRE, nor is it permitted to prefix any of these properties with ‘Is’.
No character that is in the Unicode table has the ‘Cn’ (unassigned) property. Instead, this property is assumed for any code point that is not in the Unicode table.
Specifying caseless matching does not affect these escape sequences. For example, ‘\p{Lu}’ always matches only upper case letters.
The ‘\X’ escape matches any number of Unicode characters that form an extended Unicode sequence. ‘\X’ is equivalent to
(?>\PM\pM*)
That is, it matches a character without the “mark” property, followed by zero or more characters with the “mark” property, and treats the sequence as an atomic group (see below). Characters with the “mark” property are typically accents that affect the preceding character. None of them have codepoints less than 256, so in non-UTF-8 mode ‘\X’ matches any one character.
Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as ‘\d’ and ‘\w’ do not use Unicode properties in PCRE.
The escape sequence ‘\K’, which is a Perl 5.10 feature, causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:
foo\Kbar
matches ‘foobar’, but reports that it has matched ‘bar’. This feature is similar to a lookbehind assertion (described below). However, in this case, the part of the subject before the real match does not have to be of fixed length, as lookbehind assertions do. The use of ‘\K’ does not interfere with the setting of captured substrings. For example, when the pattern
(foo)\Kbar
matches ‘foobar’, the first substring is still set to ‘foo’.
The final use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are:
\bmatches at a word boundary
\Bmatches when not at a word boundary
\Amatches at the start of the subject
\Zmatches at the end of the subject also matches before a newline at the end of the subject
\z