Nội Dung Chính

# List of Regular Expressions

## Regular Expression Operators

Operator Description
| Alternation. A|B matches either A or B.
* Match 0 or more times. Match as many times as possible.
+ Match 1 or more times. Match as many times as possible.
? Match zero or one times. Prefer one.
{n} Match exactly n times
{n,} Match at least n times. Match as many times as possible.
{n,m} Match between n and m times. Match as many times as possible, but not more than m.
*? Match 0 or more times. Match as few times as possible.
+? Match 1 or more times. Match as few times as possible.
?? Match zero or one times. Prefer zero.
{n}? Match exactly n times.
{n,}? Match at least n times, but no more than required for an overall pattern match.
{n,m}? Match between n and m times. Match as few times as possible, but not less than n.
*+ Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match).
++ Match 1 or more times. Possessive match.
?+ Match zero or one times. Possessive match.
{n}+ Match exactly n times.
{n,}+ Match at least n times. Possessive Match.
{n,m}+ Match between n and m times. Possessive Match.
( ...) Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?: ...) Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?> ...) Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the “(?>”.
(?# ...) Free-format comment (?# comment ).
(?= ...) Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ...) Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ...) Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ...) Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<name>...) Named capture group. The are literal – they appear in the pattern.
(?ismwx-ismwx:...) Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx) Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

## Set Expressions (Character Classes)

Example Description
[abc] Match any of the characters a, b or c.
[^abc] Negation – match any character except a, b or c.
[A-M] Range – match any character from A to M. The characters to include are determined by Unicode code point ordering.
[u0000-U0010ffff] Range – match all characters.
[p{L}] [p{Letter}] [p{General_Category=Letter}] Characters with Unicode Category = Letter. All forms shown are equivalent.
[P{Letter}] Negated property. (Upper case P) Match everything except Letters.
[p{numeric_value=9}] Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions.
[p{Letter}&&p{script=cyrillic}] Logical AND or intersection. Match the set of all Cyrillic letters.
[p{Letter}--p{script=latin}] Subtraction. Match all non-Latin letters.
[[a-z][A-Z][0-9]] [a-zA-Z0-9] Implicit Logical OR or Union of Sets. The examples match ASCII letters and digits. The two forms are equivalent.
[:script=Greek:] Alternate POSIX-like syntax for properties. Equivalent to p{script=Greek}.

## Case Insensitive Matching

Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the (?i) flag within a pattern itself. Unicode case insensitive matching is complicated by the fact that changing the case of a string may change its length. See http://www.unicode.org/faq/casemap_charprop.html for more information on Unicode casing operations.

Full case-insensitive matching handles situations where the number of characters in equal string may differ. “fußball” compares equal “FUSSBALL”, for example.

Simple case insensitive matching operates one character at a time on the strings being compared. “fußball” does not compare equal to “FUSSBALL”

For ICU regular expression matching,

• Anything from a regular expression pattern that looks like a literal string (even of one character) will be matched against the text using full case folding. The pattern string and the matched text may be of different lengths.
• Any sequence that is composed by the matching engine from originally separate parts of the pattern will not match with the composition boundary within a case folding expansion of the text being matched.
• Matching of [set expressions] uses simple matching. A [set] will match exactly one code point from the text.

Examples:

• pattern “fussball” will match “fußball or “fussball”.
• pattern “fu(s)(s)ball” or “fus{2}ball” will match “fussball” or “FUSSBALL” but not “fußball.
• pattern “ß” will find occurrences of “ss” or “ß”.
• pattern “s+” will not find “ß”.

With these rules, a match or capturing sub-match can never begin or end in the interior of an input text character that expanded when case folded.