Table Of Contents
Regular Expressions
Supported Syntax
Positional Operators
One-Character Operators
Character Class Operator
Branching (Alternation) Operator
Repeating Operators
Stingy (Minimal) Matching
Lookahead
Unsupported Syntax
Regular Expressions
This section is based on the documentation of the package GNU RegExp.
A regular expression consists of a character string in which some characters are given special meaning with regard to pattern matching. Regular expressions have been in use from the early days of computing, and provide a powerful and efficient way to parse, interpret, and search and replace text within an application.
Topics include:
•
Supported Syntax
•
Unsupported Syntax
Supported Syntax
Within a regular expression, the following characters have special meaning:
•
Positional Operators
•
One-Character Operators
•
Character Class Operator
•
Branching (Alternation) Operator
•
Repeating Operators
•
Stingy (Minimal) Matching
•
Lookahead
Positional Operators
•
^ matches at the beginning of a line.
•
$ matches at the end of a line.
•
\A matches the start of the entire string.
•
\Z matches the end of the entire string.
•
\b matches at a word break (Perl5 syntax only).
•
\B matches at a nonword break (opposite of \b) (Perl5 syntax only).
•
\< matches at the start of a word (egrep syntax only).
•
\> matches at the end of a word (egrep syntax only).
One-Character Operators
•
. matches any single character.
•
\d matches any decimal digit.
•
\D matches any nondigit.
•
\n matches a newline character.
•
\r matches a return character.
•
\s matches any whitespace character.
•
\S matches any nonwhitespace character.
•
\t matches a horizontal tab character.
•
\w matches any word (alphanumeric) character.
•
\W matches any nonword (alphanumeric) character.
•
\x matches the character x, if x is not one of the above listed escape sequences.
Character Class Operator
•
[abc] matches any character in the set a, b, or c.
•
[^abc] matches any character not in the set a, b, or c.
•
[a-z] matches any character in the range a to z, inclusive.
•
A leading or trailing dash is interpreted literally.
Within a character class expression, the following sequences have special meaning if the syntax bit RE_CHAR_CLASSES is on:
•
[:alnum:] Any alphanumeric character
•
[:alpha:] Any alphabetical character
•
[:blank:] A space or horizontal tab
•
[:cntrl:] A control character
•
[:digit:] A decimal digit
•
[:graph:] A nonspace, noncontrol character
•
[:lower:] A lowercase letter
•
[:print:] Same as graph, but also space and tab
•
[:punct:] A punctuation character
•
[:space:] Any whitespace character, including newline and return
•
[:upper:] An uppercase letter
•
[:xdigit:] A valid hexadecimal digit
Subexpressions and Back References
•
(abc) matches whatever the expression abc would match, and saves it as a subexpression. Also used for grouping.
•
(?:...) pure grouping operator, does not save contents
•
(?#...) embedded comment, ignored by engine
•
\n where 0 < n < 10, matches the same thing the nth subexpression matched.
Branching (Alternation) Operator
a|b matches whatever the expression a would match, or whatever the expression b would match.
Repeating Operators
These symbols operate on the previous atomic expression:
•
? matches the preceding expression or the null string.
•
* matches the null string or any number of repetitions of the preceding expression.
•
+ matches one or more repetitions of the preceding expression.
•
{m} matches exactly m repetitions of the one-character expression.
•
{m,n} matches between m and n repetitions of the preceding expression, inclusive.
•
{m,} matches m or more repetitions of the preceding expression.
Stingy (Minimal) Matching
If a repeating operator (above) is immediately followed by a question mark (?), the repeating operator stops at the smallest number of repetitions that can complete the rest of the match.
Lookahead
Lookahead refers to the ability to match part of an expression without consuming any of the input text. There are two variations:
•
(?=foo) matches at any position where foo would match, but does not consume any characters of the input.
•
(?!foo) matches at any position where foo would not match, but does not consume any characters of the input.
Unsupported Syntax
Some flavors of regular expression utilities support additional escape sequences. The following is not meant to be an exhaustive list. In the future, gnu.regexp might support some or all of the following:
•
(?mods) inlined compilation/execution modifiers (Perl5)
•
\G end of previous match (Perl5)
•
[.symbol.] collating symbol in class expression (POSIX)
•
[=class=] equivalence class in class expression (POSIX)
•
s/foo/bar/ style expressions as in sed and awk (these can be accomplished through other means in the API).