Hazel Docs: Regular Expressions

A regular expression ("RE" or "regex") is a string of characters intended to match some bit of text. While a regex can be the actual text you want to match (eg. "bat" matches "bat") the power of regular expressions is their ability to describe what you want to match (eg. "b[aeiou]t" matches "bat" or "bet" or "bit" and so on.)

Atoms

Certain characters within a regular expression will match something other than what they usually represent. As in our wonderful real world, the basic element of a regular expression is an "atom."

The simplest atom is any single character, such as one letter or number.

A dot (`.') matches any single character.

A caret (`^') matches the beginning of a line.

A dollar sign (`$') matches the end of a line.

The caret and dollar sign don't match the particular character at the beginning or end of the line, but only those positions. They're sometimes called "anchors" because they anchor your expression to the beginning or end of a line. For example, if you wanted to find all sentences which begin with "In", you'd use the regex ^In to match "In the beginning, god created a turtle." but not "I am in a cup of soup."

A character set is a sequence of characters surrounded by the left and right square-brackets (`[]'). Any one of the characters inside the set will match. If you'd like some range of letters or numbers to match, you can use a sequence such as [1-6] (matching 1, 2, 3, 4, 5, or 6) or [a-c] (matching a, b, or c.)

If a character set begins with a caret (`^'), then the set matches anything except the characters within it. For example, [^aeiou] will match any letter which isn't a vowel.

To include a `[' in a character set, make it the first character (possibly following a `^'). To include a minus sign (`-'), make it the first or last character. All other characters are literal, including the backslash and the limiters discussed later!

Character Classes

Within a character set, you can specify a character class. A class is specified by additional square brackets surrounding the name of a character class surrounded by colons.

[:blank:] A space or tab.
[:space:] Any whitespace (newlines included.)
[:alpha:] Any letter.
[:digit:] Any number.
[:alnum:] Any letter or number.
[:punct:] Anything printable which isn't something above.
[:<:] The beginning of a word.
[:>:] The end of a word.

Note that these are only valid within a character set, which means they must be within that other set of square brackets. For example, if you wanted to match any alphanumeric character, you would use [[:alnum:]] (or [[:alpha:][:digit:]], or eschew the character classes altogether and use [a-z0-9].

To match anything which isn't whitespace (this includes spaces, tabs, and any newline characters), you could use [^[:space:]].

Word matches are exceptionally useful. For example, let's say you want to find the word "art" in a string. If you used art alone, you'd also match all the tarts and farts. Using [[:<:]]art ensures that you only get words which begin with "art."

Grouping

If you want to match any of several strings, you can surround them all in parentheses and separate them with pipe characters (`|'). For example, (dog|cat) will match either "dog" or "cat".

Whenever you surround something in parentheses, it is remembered by the regex engine. In Hazel, this means you can access the first parenthesized match with %HZV_RE1, the second with %HZV_RE2, and so on.

Limiters

An atom may be followed by special characters which describe how many instances of the atom will result in a successful match.

A question mark (`?') means that either zero or one instance will match.

A plus sign (`+') means that one or more of the atoms will match.

An asterisk (`*') means that zero or more of the atoms will match.

A curly-bracketed "boundary" of {x}, {x,y}, or {x,} accepts exactly `x', between `x' and `y' inclusively, or `x' or more matches.

If you know exactly how many instances of an atom you want matched, be specific. Don't use the asterisk when you actually want at least one match. An asterisk left unchecked will gobble up the rest of your string, preventing what might have been a more precise match on later characters.

Specs

Hazel's regular expression engine uses code from Henry Spencer's regex package, Copyright 1992, 1993, 1994, 1997 Henry Spencer. All rights reserved.

Hazel's regular expressions are not sensitive to case. That is, lowercase letters match both themselves and their uppercase counterparts. She uses "extended" regular expressions exclusively.

For documentation on Hazel elements which use regular expressions, see entries for Hazel-Choice, Hazel-Regex and Hazel-Subst.


Getting Started HZML Rules Extras Advanced Reference
Walkthrough
Configuration
Products File
Order Reporting
Platforms
Upgrading
Known Problems
Actions
HZML Tokens
HZML Tags
HZML Loops
HZML & HAM
Overview
Shipping
Sales Tax
Discounts
Surcharges
Tweaking
Customization
Input Fields
Softgoods
Search Engine
Optioned Products
Plugins
Design Tips
Themes
Currency
Payment Methods
Coupons
Regular Expressions
Perl API
hazel.config
Templates
HTML Basics
CGI and You
ChangeLog

Hazel Home - Help Contents - Searchable Knowledge Base - Live Technical Support