Regular Expressions

A regular expression (sometimes abbreviated to regex) is a way for a computer user or programmer to express how a computer program should look for a specified pattern in text and then what the program is to do when each pattern match is found. For example, a very simple regular expression could tell a program to search for all text lines that contain the phrase “OS X 10.6” and then to print out each line in which a match is found or substitute another text sequence (for example, just “OS X”) where any match occurs. More complicated regular expressions can be designed to search for just about anything – email addresses, phone numbers, urls, prices, part numbers, you name it. In fact, entire books have been written about regular expressions and how to use them (more on those in a moment). Regular expressions are an extremely powerful tool that can almost seem as if they have superhero powers!

xkcd 208

Learning About Regular Expressions

Regular expressions are a huge topic. Rather than re-invent the wheel here, we suggest that you refer to the huge library of learning material available both online and in books. A Google search for regular expressions will turn up hundreds of links, here are some of the most useful we’ve found:

Regular Expressions Info (A very complete step-by-step tutorial)

Wikipedia (Basics and History of regular expressions)

Perl Regex Tutorial (The Perl language syntax is quite different from Panorama, but the actual regex format is nearly identical.)

We also highly recommend these two O’Reilly books:

Mastering Regular Expressions (Jeffery E.F. Freidl)

Regular Expressions Cookbook (Jan Goyvaerts, Steven Levithan)

There are a number of regular expression software tools available for the Mac. Our favorite is RegExRX, which is available for $5 from the Mac App Store.

This software allows you to see the immediate results of a regular expression as you type. We often develop a regular expression using RegExRx and then transfer it to Panorama.

Regular Expression Syntax

The following tables are a reference to the character expressions used by Panorama’s regular expression engine to match patterns within a string, the pattern operators that specify how many times a pattern is matched and additional matching restrictions, and the last table specifies flags that can be included in the regular expression pattern that specify search behavior over multiple lines.

Note: Panorama actually uses the regular expression engine supplied by Apple with OS X and iOS. This engine is called the ICU regular expression guide and was developed by IBM and made available under an open source license.

Regular Expression Metacharacters

This table describes the character sequences used to match characters within a string.

Character Expression	Description
`\a`	Match a `BELL`, `\u0007`
`\A`	Match at the beginning of the input. Differs from `^` in that `\A` will not match after a new line within the input.
`\b` (outside of a [Set])	Match if the current position is a word boundary. Boundaries occur at the transitions between word (`\w`) and non-word (`\W`) characters, with combining marks ignored.
`\b` (within a [Set])	Match a `BACKSPACE`, `\u0008`.
`\B`	Match if the current position is not a word boundary.
`\cX`	Match a control-X character
`\d`	Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
`\D`	Match any character that is not a decimal digit.
`\e`	Match an `ESCAPE`, `\u001B`.
`\E`	Terminates a `\Q ... \E` quoted sequence.
`\f`	Match a `FORM FEED`, `\u000C`.
`\G`	Match if the current position is at the end of the previous match.
`\n`	Match a `LINE FEED`, `\u000A`.
`\N{UNICODE CHARACTER NAME}`	Match the named character.
`\p{UNICODE PROPERTY NAME}`	Match any character with the specified Unicode Property.
`\P{UNICODE PROPERTY NAME}`	Match any character not having the specified Unicode Property.
`\Q`	Quotes all following characters until `\E`.
`\r`	Match a `CARRIAGE RETURN, \u000D.`
`\s`	Match a white space character. White space is defined as `[\t\n\f\r\p{Z}]`.
`\S`	Match a non-white space character.
`\t`	Match a `HORIZONTAL TABULATION`, `\u0009`.
`\uhhhh`	Match the character with the hex value `hhhh`.
\Uhhhhhhhh	Match the character with the hex value `hhhhhhhh`. Exactly eight hex digits must be provided, even though the largest Unicode code point is `\U0010ffff`.
`\w`	Match a word character. Word characters are `[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]`.
`\W`	Match a non-word character.
`\x{hhhh}`	Match the character with hex value `hhhh`. From one to six hex digits may be supplied.
`\xhh`	Match the character with two digit hex value `hh`.
`\X`	Match a Grapheme Cluster.
`\Z`	Match if the current position is at the end of input, but before the final line terminator, if one exists.
`\z`	Match if the current position is at the end of input.
`\n`	Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
`\0ooo`	Match an Octal character. `ooo` is from one to three octal digits. `0377` is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references.
`[pattern]`	Match any one character from the pattern.
`.`	Match any character.
`^`	Match at the beginning of a line.
`$`	Match at the end of a line.
`\`	Quotes the following character. Characters that must be quoted to be treated as literals are `* ? + [ ( ) { } ^ $ \| \ . /`

Regular Expression Operators

This table defines the regular expression operators.

Operator	Description
`\|`	Alternation. `A\|B` matches either `A` or `B`.
`*`	Match 0 or more times. Match as many times as possible.
`+`	Match 1 or more times. Match as many times as possible.
`?`	Match zero or one times. Prefer one.
`{n}`	Match exactly n times.
`{n,}`	Match at least n times. Match as many times as possible.
`{n,m}`	Match between n and m times. Match as many times as possible, but not more than m.
`*?`	Match 0 or more times. Match as few times as possible.
`+?`	Match 1 or more times. Match as few times as possible.
`??`	Match zero or one times. Prefer zero.
`{n}?`	Match exactly n times.
`{n,}?`	Match at least n times, but no more than required for an overall pattern match.
`{n,m}?`	Match between n and m times. Match as few times as possible, but not less than n.
`*+`	Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match).
`++`	Match 1 or more times. Possessive match.
`?+`	Match zero or one times. Possessive match.
`{n}+`	Match exactly n times.
`{n,}+`	Match at least n times. Possessive Match.
`{n,m}+`	Match between n and m times. Possessive Match.
`(...)`	Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
`(?:...)`	Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
`(?>...)`	Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the `"(?>"`
`(?# ... )`	Free-format comment `(?# comment)`.
`(?= ... )`	Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
`(?! ... )`	Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
`(?<= ... )`	Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no `*` or `+` operators.)
`(?`	Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no `*` or `+` operators.)
`(?ismwx-ismwx: ... )`	Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. The flags are defined in Flag Options (see below).

Template Matching Format

The regexreplace( function matches a regular expression and replaces any found matches with a template. The template may contain special characters as described in the table below.

Character	Descriptions
`$n`	The text of capture group n will be substituted for `$n`. n must be greater than or equal to 0 and not greater than the number of capture groups. A `$` not followed by a digit has no special meaning, and will appear in the substitution text as itself, a `$`.
`\`	Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for `'$'` and `'\'`, but may be used on any other character without bad effects.

The replacement string is treated as a template, with $0 being replaced by the contents of the matched range, $1 by the contents of the first capture group, and so on. Additional digits beyond the maximum required to represent the number of capture groups will be treated as ordinary characters, as will a $ not followed by digits. Backslash will escape both $ and \.

Flag Options

The following flags control various aspects of regular expression matching. These flag values may be specified within the pattern using the (?ismx-ismx) pattern options (see above).

Flag (Pattern)	Description
`i`	If set, matching will take place in a case-insensitive manner.
`x`	If set, allow use of white space and `#comments` within patterns
`s`	If set, a `"."` in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return / line-feed pair in text behave as a single line terminator, and will match a single "." in a regular expression pattern.
`m`	Control the behavior of `"^"` and `"$"` in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, `"^"` and `"$"` will also match at the start and end of each line within the input text.
`Para`	Controls the behavior of `\b` in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either word or non-word, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.

ICU License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder.

All trademarks and registered trademarks mentioned herein are the property of their respective owners.

See Also

regexarray( -- applies a regular expression to a text value, then builds an array containing all of the substrings that match the regular expression (see Regular Expressions).
regexarrayexact( -- applies a regular expression to a text value, then builds an array containing all of the substrings that match the regular expression (see Regular Expressions).
regexliteral( -- adds \ characters to text as necessary so that it can be used as a literal in a regular expression.
regexmatch -- checks to see if the text on the left matches the regular expression on the right (see Regular Expressions).
regexmatchexact -- checks to see if the text on the left matches the regular expression on the right (see Regular Expressions).
regexreplace( -- replaces text with new text. The text to be replaced is determined by a regular expression.
regexreplaceexact( -- replaces text with new text. The text to be replaced is determined by a regular expression.
regexreplacefirst( -- replaces the first occurrence of a regular expression pattern with new text.
regexreplacefirstexact( -- replaces the first occurrence of a regular expression pattern with new text.

History

Version	Status	Notes
10.0	New	Regular expression support is new in this version