Regular Expression Elements

Wildcard Matching

. (dot)

The dot . matches any character.

Example: b.g matches big, beg, and bag, but not bp or baag.

If you use the "multi-line" version of the regular expression syntax, then the dot (.) character also matches new-lines. For example .* matches the whole file.

Matching the Beginning or End of a Line

^ and $

The caret ^ matches the beginning of a line when the caret appears as the first character in the search pattern.

Example: ^Hello matches only if Hello appears at the beginning of a line.

The $ matches the end of a line.

Example: TRUE$ matches only if TRUE appears at the very end of a line.

Matching a Tab or Space

\t
\s
\w

\t matches a single tab character.

Example: \tint abc; matches a tab character followed by int abc;.

\s matches a single space character.

Example: \sif matches a space character followed by if.

\w matches a single white space character. In other words, \w matches either a tab or space character.

Example: \wwhile matches either a tab or space character, followed by while.

Matching 0, 1, or More Occurrences

* and +

* matches zero or more occurrences of the preceding character. The fewest possible occurrences of a pattern will satisfy the match.

Example: a*b will match b, ab, aab, aaab, aaaab, and so on.

+ matches one or more occurrences of the preceding character.

Example: a+b will match ab, aab, aaab, aaaab, and so on, but not just b.

Matching Any in a Set of Characters

[ .. ]

When a list of characters are enclosed in square braces [..] then any character in that set will be matched.

Example: [abc] matches a, b, and c, but not d.

When a caret ^ appears at the beginning of the set, the match succeeds only if the character is not in the set.

Example: [^abc] matches d, e, or f, but not a, b, or c.

Sets can conveniently be described with a range. A range is specified by two characters separated by a dash, such as [a-z]. The beginning character must have a lower ASCII value than the ending character.

Example: [a-z] matches any character in the range a through z, but not A or 1 or 2.

Sets can contain multiple ranges.

Example 1: [a-zA-Z] matches any alphabetic character.

Example 2: [^a-zA-Z0-9] matches any non-alphanumeric character.

Matching a Line Break

This matches a new-line, or line-break. Use this when you want to match an end-of-line within a larger pattern.

Example: dog\ncat matches dog, followed by a line break, followed by cat.

Regular Expression Groups

$ and $

Parts of a regular expression can be isolated by enclosing them with $ and $, thereby forming a group. Groups are useful for extracting part of a match to be used in a replacement pattern. Each group in a pattern is assigned a number, starting with 1, from left to right.

Example: abc$xyz$ matches abcxyz. xyz is considered group #1.

This is not all that useful, unless we are using the Replace command. The replace string can contain group characters in the form of \<number>. Each time a group character is encountered in the replacement pattern, it means "substitute the group value from the matched pattern".

Example 1: replace $abc$$xyz$ with \2\1. This replaces the matched string abcxyz with the contents of group #2 xyz, followed by the contents of group #1 abc. So abcxyz is replaced with xyzabc. This is still not too amazing. See the next example.

Example 2: replace $\w+$$.*$ing with \1\2ed. This changes words ending in ing with the same word ending with ed. Your English teacher would not be too happy.

Overriding Regular Expression Characters

\ (backslash)

A backslash character \ preceding a meta-character overrides its special meaning. The backslash is ignored from the string.

Example: a\*b matches a*b literally. The * character does not mean "match 0 or more occurrences".