0

List of Regular Expressions

List of Regular Expressions

Term

Representation/Use

Any character

The given character, unless it is a regular expression meta character. The list of meta characters follows in this table.

.

Any single character except a line break or a paragraph break. For example, the search term “sh.rt” matches both “shirt” and “short”.

^

The beginning of a paragraph or cell. Special objects such as empty fields or character-anchored frames, at the beginning of a paragraph are ignored. Example: “^Peter” matches the word “Peter” only when it is the first word of a paragraph.

$

The end of a paragraph or cell. Special objects such as empty fields or character-anchored frames at the end of a paragraph are ignored. Example: “Peter$” matches only when the word “Peter” is the last word of a paragraph, note “Peter” cannot be followed by a period.

$ on its own matches the end of a paragraph. This way it is possible to search and replace paragraph breaks.

*

Zero or more of the regular expression term immediately preceding it. For example, “Ab*c” matches “Ac”, “Abc”, “Abbc”, “Abbbc”, and so on.

+

One or more of the regular expression term immediately preceding it. For example, “AX.+4” finds “AXx4”, but not “AX4”.

The longest possible string that matches this regular expression in a paragraph is always matched. If the paragraph contains the string “AX 4 AX4”, the entire passage is highlighted.

?

Zero or one of the regular expression term immediately preceding it. For example, “Texts?” matches “Text” and “Texts” and “x(ab|c)?y” finds “xy”, “xaby”, or “xcy”.

The special character that follows it is interpreted as a normal character and not as a regular expression meta character (except for the combinations “n”, “t”, “b”, “>” and “<“). For example, “tree.” matches “tree.”, not “treed” or “trees”.

n

A line break that was inserted with the Shift+Enter key combination when in the Find text box.

A paragraph break that can be entered with the Enter or Return key when in the Replace text box in Writer. Has no special meaning in Calc, and is treated literally there.

To change line breaks into paragraph breaks, enter n in both the Find and Replace boxes, and then perform a search and replace.

t

A tab character. Can also be used in the Replace box.

b

A word boundary. For example, “bbook” matches “bookmark” and “book” but not “checkbook” whereas “bookb” matches “checkbook” and “book” but not “bookmark”.

Note, this form replaces the obsolete (although they still work for now) forms “>” (match end of word) and “<” (match start of word).

^$

Finds an empty paragraph.

^.

Finds the first character of a paragraph.

& or $0

Adds the string that was found by the search criteria in the Find box to the term in the Replace box when you make a replacement.

For example, if you enter “window” in the Find box and “&frame” in the Replace box, the word “window” is replaced with “windowframe”.

You can also enter an “&” in the Replace box to modify the Attributes or the Format of the string found by the search criteria.

[…]

Any single occurrence of any one of the characters that are between the brackets. For example: “[abc123]” matches the characters ‘a’, ‘b’, ’c’, ‘1’, ‘2’ and ‘3’. “[a-e]” matches single occurrences of the characters a through e, inclusive (the range must be specified with the character having the smallest Unicode code number first). “[a-eh-x]” matches any single occurrence of the characters that are in the ranges ‘a’ through ‘e’ and ‘h’ through ‘x’.

[^…]

Any single occurrence of a character, including Tab, Space and Line Break characters, that is not in the list of characters specified inclusive ranges are permitted. For example “[^a-syz]” matches all characters not in the inclusive range ‘a’ through ‘s’ or the characters ‘y’ and ‘z’.

uXXXX

UXXXXXXXX

The character represented by the four-digit hexadecimal Unicode code (XXXX).

The character represented by the eight-digit hexadecimal Unicode code (XXXXXXXX).

For certain symbol fonts the symbol (glyph) that you see on screen may look related to a different Unicode code than that is actually used for it in the font. The Unicode codes can be viewed by choosing Insert – Special Character, or using Unicode conversion shortcut.

|

The infix operator delimiting alternatives. Matches the term preceding the “|” or the term following the “|”. For example, “this|that” matches occurrences of both “this” and “that”.

{N}

The post-fix repetition operator that specifies an exact number of occurrences (“N”) of the regular expression term immediately preceding it must be present for a match to occur. For example, “tre{2}” matches “tree”.

{N,M}

The post-fix repetition operator that specifies a range (minimum of “N” to a maximum of “M”) of occurrences of the regular expression term immediately preceding it that can be present for a match to occur. For example, “tre{1,2}” matches “tre” and “tree”.

{N,}

The post-fix repetition operator that specifies a range (minimum “N” to an unspecified maximum) of occurrences of the regular expression term immediately preceding it that can be present for a match to occur. (The maximum number of occurrences is limited only by the size of the document). For example, “tre{2,}” matches “tree”, “treee”, and “treeeee”.

(…)

The grouping construct that serves three purposes.

  1. To enclose a set of ‘|’ alternatives. For example, the regular expression “b(oo|ac)k” matches both “book” and “back”.

  2. To group terms in a complex expression to be operated on by the post-fix operators: “*”, “+” and “?” along with the post-fix repetition operators. For example, the regular expression “a(bc)?d” matches both “ad” and “abcd” in a search.; the regular expression “M(iss){2}ippi” matches “Mississippi”.

  3. To record the matched sub string inside the parentheses as a reference for later use in the Find box using the “n” construct or in the Replace box using the “$n” construct. The reference to the first match is represented by “1” in the Find box and by “$1” in the Replace box. The reference to the second matched sub string by “2” and “$2” respectively, and so on.

For example, the regular expression “(890)711” matches “8907890890”.

With the regular expression “b(fruit|truth)b” in the Find box and the regular expression “$1ful” in the Replace box occurrences of the words “fruit” and “truth” can be replaced with the words “fruitful” and “truthful” respectively without affecting the words “fruitfully” and “truthfully”

[:alpha:]

Represents an alphabetic character. Use [:alpha:]+ to find one of them.

[:digit:]

Represents a decimal digit. Use [:digit:]+ to find one of them.

[:alnum:]

Represents an alphanumeric character ([:alpha:] and [:digit:]).

[:space:]

Represents a space character (but not other whitespace characters).

[:print:]

Represents a printable character.

[:cntrl:]

Represents a nonprinting character.

[:lower:]

Represents a lowercase character if Match case is selected in Options.

[:upper:]

Represents an uppercase character if Match case is selected in Options.

 

For a full list of supported metacharacters and syntax, see ICU Regular Expressions documentation

Note that currently all named character class terms, [:alpha:] through [:upper:], must be enclosed in parentheses when used in a regular expression, see the examples that follow.

Regular expression terms can be combined to form complex and sophisticated regular expressions for searches as show in the following examples.

Examples

Expression

Meaning

^$

An empty paragraph.

^ specifies that the match must be at the start of a paragraph,

$ specifies that a paragraph mark or the end of a cell must follow the matched string.

^.

The first character of a paragraph.

^ specifies that the match must be at the start of a paragraph,

. specifies any single character.

e([:digit:])?

Matches “e” by itself or an “e” followed by one digit.

e specifies the character “e”,

[:digit:] specifies any decimal digit,

? specifies zero or one occurrences of [:digit:].

^([:digit:])$

Matches a paragraph or cells containing exactly one digit.

^ specifies that the match must be at the start of a paragraph,

[:digit:] specifies any decimal digit,

$ specifies that a paragraph mark or the end of a cell must follow the matched string.

^[:digit:]{3}$

Matches a paragraph or cell containing only three digit numbers

^ specifies that the match must be at the start of a paragraph,

[:digit:] specifies any decimal digit,

{3} specifies that [:digit:] must occur three times,

$ specifies that a paragraph mark or the end of a cell must follow the matched string.

bconst(itu|ruc)tionb

Matches the words “constitution” and “construction” but not the word “constitutional.”

b specifies that the match must begin at a word boundary,

const specifies the characters “const”,

( starts the group,

itu specifies the characters “itu”,

| specifies the alternative,

ruc specifies the characters “ruc”,

) ends the group,

tion specifies the characters “tion”,

b specifies that the match must end at a word boundary.

Regular Expression Operators

Operator Description
| Alternation. A|B matches either A or B.
* Match 0 or more times. Match as many times as possible.
+ Match 1 or more times. Match as many times as possible.
? Match zero or one times. Prefer one.
{n} Match exactly n times
{n,} Match at least n times. Match as many times as possible.
{n,m} Match between n and m times. Match as many times as possible, but not more than m.
*? Match 0 or more times. Match as few times as possible.
+? Match 1 or more times. Match as few times as possible.
?? Match zero or one times. Prefer zero.
{n}? Match exactly n times.
{n,}? Match at least n times, but no more than required for an overall pattern match.
{n,m}? Match between n and m times. Match as few times as possible, but not less than n.
*+ Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match).
++ Match 1 or more times. Possessive match.
?+ Match zero or one times. Possessive match.
{n}+ Match exactly n times.
{n,}+ Match at least n times. Possessive Match.
{n,m}+ Match between n and m times. Possessive Match.
( ...) Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?: ...) Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?> ...) Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the “(?>”.
(?# ...) Free-format comment (?# comment ).
(?= ...) Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ...) Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ...) Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ...) Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<name>...) Named capture group. The are literal – they appear in the pattern.
(?ismwx-ismwx:...) Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx) Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

Set Expressions (Character Classes)

Example Description
[abc] Match any of the characters a, b or c.
[^abc] Negation – match any character except a, b or c.
[A-M] Range – match any character from A to M. The characters to include are determined by Unicode code point ordering.
[u0000-U0010ffff] Range – match all characters.
[p{L}] [p{Letter}] [p{General_Category=Letter}] Characters with Unicode Category = Letter. All forms shown are equivalent.
[P{Letter}] Negated property. (Upper case P) Match everything except Letters.
[p{numeric_value=9}] Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions.
[p{Letter}&&p{script=cyrillic}] Logical AND or intersection. Match the set of all Cyrillic letters.
[p{Letter}--p{script=latin}] Subtraction. Match all non-Latin letters.
[[a-z][A-Z][0-9]] [a-zA-Z0-9] Implicit Logical OR or Union of Sets. The examples match ASCII letters and digits. The two forms are equivalent.
[:script=Greek:] Alternate POSIX-like syntax for properties. Equivalent to p{script=Greek}.

Case Insensitive Matching

Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the (?i) flag within a pattern itself. Unicode case insensitive matching is complicated by the fact that changing the case of a string may change its length. See http://www.unicode.org/faq/casemap_charprop.html for more information on Unicode casing operations.

Full case-insensitive matching handles situations where the number of characters in equal string may differ. “fußball” compares equal “FUSSBALL”, for example.

Simple case insensitive matching operates one character at a time on the strings being compared. “fußball” does not compare equal to “FUSSBALL”

For ICU regular expression matching,

  • Anything from a regular expression pattern that looks like a literal string (even of one character) will be matched against the text using full case folding. The pattern string and the matched text may be of different lengths.
  • Any sequence that is composed by the matching engine from originally separate parts of the pattern will not match with the composition boundary within a case folding expansion of the text being matched.
  • Matching of [set expressions] uses simple matching. A [set] will match exactly one code point from the text.

Examples:

  • pattern “fussball” will match “fußball or “fussball”.
  • pattern “fu(s)(s)ball” or “fus{2}ball” will match “fussball” or “FUSSBALL” but not “fußball.
  • pattern “ß” will find occurrences of “ss” or “ß”.
  • pattern “s+” will not find “ß”.

With these rules, a match or capturing sub-match can never begin or end in the interior of an input text character that expanded when case folded.

hocbaicungcon

Leave a Reply

Your email address will not be published. Required fields are marked *