Logo Search packages:      
Sourcecode: highlight version File versions  Download package

Pattern Class Reference

#include <Pattern.h>

List of all members.


Detailed Description

This pattern class is very similar in functionality to Java's java.util.regex.Pattern class. The pattern class represents an immutable regular expression object. Instead of having a single object contain both the regular expression object and the matching object, instead the two objects are split apart. The Matcher class represents the maching object.

The Pattern class works primarily off of "compiled" patterns. A typical instantiation of a regular expression looks like:

  Pattern * p = Pattern::compile("a*b");
  Matcher * m = p->createMatcher("aaaaaab");
  if (m->matches()) ...
  

However, if you do not need to use a pattern more than once, it is often times okay to use the Pattern's static methods insteads. An example looks like this:

  if (Pattern::matches("a*b", "aaaab")) { ... }
  

This class does not currently support unicode. The unicode update for this class is coming soon.

This class is partially immutable. It is completely safe to call createMatcher concurrently in different threads, but the other functions (e.g. split) should not be called concurrently on the same Pattern.

Construct Matches
 
Characters
x The character x
\ The character </code>
nn The character with octal ASCII value nn
nnn The character with octal ASCII value nnn
hh The character with hexadecimal ASCII value hh
A tab character
A carriage return character

A new-line character
 
Character Classes
[abc] Either a, b, or c
[^abc] Any character but a, b, or c
[a-zA-Z] Any character ranging from a thru z, or A thru Z
[^a-zA-Z] Any character except those ranging from a thru z, or A thru Z
[a\-z] Either a, -, or z
[a-z[A-Z]] Same as [a-zA-Z]
[a-z&&[g-i]] Any character in the intersection of a-z and g-i
[a-z&&[^g-i]] Any character in a-z and not in g-i
 
Prefefined character classes
. Any character. Multiline matching must be compiled into the pattern for . to match a or a
. Even if multiline matching is enabled, . will not match a
, only a or a
.
[0-9]
[^]

]
[^]
[a-zA-Z0-9_]
[^]
 
POSIX character classes
{Lower} [a-z]
{Upper} [A-Z]
{ASCII} [-]
{Alpha} [a-zA-Z]
{Digit} [0-9]
{Alnum} [&&[^_]]
{Punct} [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
{XDigit} [a-fA-F0-9]
 
Boundary Matches
^ The beginning of a line. Also matches the beginning of input.
$ The end of a line. Also matches the end of input.
A word boundary
A non word boundary
The beginning of input
The end of the previous match. Ensures that a "next" match will only happen if it begins with the character immediately following the end of the "current" match.
The end of input. Will also match if there is a single trailing
, a single trailing , or a single trailing
.
The end of input
 
Greedy Quantifiers
x? x, either zero times or one time
x* x, zero or more times
x+ x, one or more times
x{n} x, exactly n times
x{n,} x, at least n times
x{,m} x, at most m times
x{n,m} x, at least n times and at most m times
 
Possessive Quantifiers
x?+ x, either zero times or one time
x*+ x, zero or more times
x++ x, one or more times
x{n}+ x, exactly n times
x{n,}+ x, at least n times
x{,m}+ x, at most m times
x{n,m}+ x, at least n times and at most m times
 
Reluctant Quantifiers
x?? x, either zero times or one time
x*? x, zero or more times
x+? x, one or more times
x{n}? x, exactly n times
x{n,}? x, at least n times
x{,m}? x, at most m times
x{n,m}? x, at least n times and at most m times
 
Operators
xy x then y
x|y x or y
(x) x as a capturing group
 
Quoting
Nothing, but treat every character (including ) literally until a matching
Nothing, but ends its matching
 
Special Constructs
(?:x) x, but not as a capturing group
(?=x) x, via positive lookahead. This means that the expression will match only if it is trailed by x. It will not "eat" any of the characters matched by x.
(?!x) x, via negative lookahead. This means that the expression will match only if it is not trailed by x. It will not "eat" any of the characters matched by x.
(?<=x) x, via positive lookbehind. x cannot contain any quantifiers.
(?<!x) x, via negative lookbehind. x cannot contain any quantifiers.
(?>x) x{1}+
 
Registered Expression Matching
{x} The registered pattern x


Begin Text Extracted And Modified From java.util.regex.Pattern documentation

Backslashes, escapes, and quoting

The backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \ matches a single backslash and matches a left brace.

It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.

It is necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by a compiler. The string literal "&#92;b", for example, matches a single backspace character when interpreted as a regular expression, while "&#92;&#92;b" matches a word boundary. The string litera "&#92;(hello&#92;)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "&#92;&#92;(hello&#92;&#92;)" must be used.

Character Classes

Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&). The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator denotes a class that contains every character that is in both of its operand classes.

The precedence of character-class operators is as follows, from highest to lowest:

<blockquote>
1     Literal escape    
2     Range a-z
3     Grouping [...]
4     Intersection [a-z&&[aeiou]]
5     Union [a-e][i-u]
</blockquote>

Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter.

Groups and capturing

Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

<blockquote>
1     ((A)(B(C)))
2     (A)
3     (B(C))

4     (C)
</blockquote>

Group zero always stands for the entire expression.

Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.

Unicode support

Coming Soon.

Comparison to Perl 5

The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Perl constructs not supported by this class:

Constructs supported by this class but not by Perl:

Notable differences from Perl:

For a more precise description of the behavior of regular expression constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E. F. Friedl, O'Reilly and Associates, 2002.

End Text Extracted And Modified From java.util.regex.Pattern documentation


Author:
Jeffery Stuart
Since:
March 2003, Stable Since November 2004
Version:
1.04 A class used to represent "PERL 5"-ish regular expressions

Definition at line 968 of file Pattern.h.


Public Member Functions

MatchercreateMatcher (const std::string &str)
std::vector< std::string > findAll (const std::string &str)
unsigned long getFlags () const
std::string getPattern () const
bool matches (const std::string &str)
std::string replace (const std::string &str, const std::string &replacementText)
std::vector< std::string > split (const std::string &str, const bool keepEmptys=0, const unsigned long limit=0)
 ~Pattern ()

Static Public Member Functions

static void clearPatternCache ()
static Patterncompile (const std::string &pattern, const unsigned long mode=0)
static PatterncompileAndKeep (const std::string &pattern, const unsigned long mode=0)
static std::vector< std::string > findAll (const std::string &pattern, const std::string &str, const unsigned long mode=0)
static std::pair< std::string,
int > 
findNthMatch (const std::string &pattern, const std::string &str, const int matchNum, const unsigned long mode=0)
static bool matches (const std::string &pattern, const std::string &str, const unsigned long mode=0)
static bool registerPattern (const std::string &name, const std::string &pattern, const unsigned long mode=0)
static std::string replace (const std::string &pattern, const std::string &str, const std::string &replacementText, const unsigned long mode=0)
static std::vector< std::string > split (const std::string &pattern, const std::string &str, const bool keepEmptys=0, const unsigned long limit=0, const unsigned long mode=0)
static void unregisterPatterns ()

Static Public Attributes

static const unsigned long CASE_INSENSITIVE = 0x01
 We should match regardless of case.
static const unsigned long DOT_MATCHES_ALL = 0x04
 We should treat a . as [-]
static const unsigned long LITERAL = 0x02
 We are implicitly quoted.
static const int MAX_QMATCH = 0x7FFFFFFF
 The absolute maximum number of matches a quantifier can match (0x7FFFFFFF).
static const int MIN_QMATCH = 0x00000000
 The absolute minimum number of matches a quantifier can match (0).
static const unsigned long MULTILINE_MATCHING = 0x08
static const unsigned long UNIX_LINE_MODE = 0x10

Protected Member Functions

std::string classCreateRange (char low, char hi) const
std::string classIntersect (std::string s1, std::string s2) const
std::string classNegate (std::string s1) const
std::string classUnion (std::string s1, std::string s2) const
int getInt (int start, int end)
NFANode * parse (const bool inParen=0, const bool inOr=0, NFANode **end=NULL)
NFANode * parseBackref ()
NFANode * parseBehind (const bool pos, NFANode **end)
std::string parseClass ()
std::string parseEscape (bool &inv, bool &quo)
std::string parseHex ()
std::string parseOctal ()
std::string parsePosix ()
NFANode * parseQuote ()
NFANode * parseRegisteredPattern (NFANode **end)
NFANode * quantify (NFANode *newNode)
bool quantifyCurly (int &sNum, int &eNum)
NFANode * quantifyGroup (NFANode *start, NFANode *stop, const int gn)
void raiseError ()
NFANode * registerNode (NFANode *node)

Protected Attributes

int curInd
bool error
unsigned long flags
int groupCount
NFANode * head
Matchermatcher
std::map< NFANode *, bool > nodes
int nonCapGroupCount
std::string pattern

Static Protected Attributes

static std::map< std::string,
Pattern * > 
compiledPatterns
static std::map< std::string,
std::pair< std::string,
unsigned long > > 
registeredPatterns

Private Member Functions

 Pattern (const std::string &rhs)

Friends

class Matcher
class NFANode
class NFAQuantifierNode

The documentation for this class was generated from the following files:

Generated by  Doxygen 1.6.0   Back to index