Regular Expression HOWTO

Introduction
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the “re” module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like.

You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

In actual programs, the most common style is to store the match object in a variable, and then check if it was None. This usually looks like:

# Import the regular expression module
import re

# Compile the regular expression pattern and store it in a variable
pattern = re.compile( ... )

# Use the pattern to search for a match in the given string and store the match object in a variable
match = pattern.match('string goes here')

# Check if a match was found by checking if the match object is not None
if match:
    # If a match was found, print the matched string using the group() method of the match object
    print('Match found: ', match.group())
else:
    # If no match was found, print a message indicating so
    print('No match')

Two pattern methods return all of the matches for a pattern. findall() returns a list of matching strings:

# Import the regular expression module
import re

# Compile a regular expression pattern that matches one or more digits
pattern = re.compile(r'\d+')

# Use the findall() method to search for all matches of the pattern in the given string
matches = pattern.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')

# Print the list of matching strings
print(matches)

# Output: ['12', '11', '10']

# The re module is imported to use regular expressions
# The pattern is compiled using the re.compile() method and assigned to the variable 'pattern'
# The pattern matches one or more digits using the \d+ expression
# The findall() method is used to search for all matches of the pattern in the given string
# The matches are stored in the 'matches' variable as a list
# The list is printed using the print() function

The r prefix is needed in this example because escape sequences in a normal cooked string literal that are not recognized by Python, as opposed to regular expressions, now result in a DeprecationWarning and will eventually become a SyntaxError. See The Backslash Plague for more information. findall() has to create the entire list before it can be returned as the result.

The finditer() method returns a sequence of match object instances as an iterator:

# Import the regular expression module
import re

# Compile a regular expression pattern that matches one or more digits
pattern = re.compile(r'\d+')

# Use the finditer() method to search for all matches in the given string
# and return an iterator of match objects
matches = pattern.finditer('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')

# Loop through the iterator and print the start and end indices of each match
for match in matches:
    print(match.span())

# Output:
# (0, 2) - the first match is '12' and starts at index 0 and ends at index 2
# (22, 24) - the second match is '11' and starts at index 22 and ends at index 24
# (42, 44) - the third match is '10' and starts at index 42 and ends at index 44

Module-Level Functions
You dont have to create a pattern object and call its methods; the re module also provides top-level functions called match(), search(), findall(), sub(), and so forth. These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either None or a match object instance.

Compilation Flags
Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I. (If youre familiar with Perls pattern modifiers, the one-letter forms use the same letters; the short form of re.VERBOSE is re.X, for example.) Multiple flags can be specified by bitwise OR-ing them; re.I | re.M sets both the I and M flags, for example.

Using re.VERBOSE
By now youve probably noticed that regular expressions are a very compact notation, but theyre not terribly readable. REs of moderate complexity can become lengthy collections of backslashes, parentheses, and metacharacters, making them difficult to read and understand. For such REs, specifying the re.VERBOSE flag when compiling the regular expression can be helpful, because it allows you to format the regular expression more clearly.

The re.VERBOSE flag has several effects. Whitespace in the regular expression that isnt inside a character class is ignored. This means that an expression such as dog | cat is equivalent to the less readable dog|cat, but [a b] will still match the characters ‘a’, ‘b’, or a space. In addition, you can also put comments inside a RE; comments extend from a # character to the next newline. When used with triple-quoted strings, this enables REs to be formatted more neatly:

# Import the regular expression module
import re

# Compile the regular expression pattern with verbose mode
# The verbose mode allows for comments and whitespace to be added for readability
# The pattern matches a header name and its corresponding value
pat = re.compile(r"""
    \s*                 # Skip leading whitespace
    (?P<header>[^:]+)   # Header name
    \s* :               # Whitespace, and a colon
    (?P<value>.*?)      # The header's value -- *? used to lose the following trailing whitespace
    \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)

This is far more readable than:

# Import the regular expression module
import re

# Create a pattern object using the compile() function
# The pattern will match a string that starts with any number of whitespace characters,
# followed by a group named "header" that contains any characters except for a colon,
# followed by any number of whitespace characters,
# followed by a colon,
# followed by a group named "value" that contains any characters until the end of the string
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")

In the above example, we’ve used triple-quoted strings to format our regular expression more neatly by adding comments and whitespace. The re.VERBOSE flag is set so that these changes don’t affect how the pattern matches input data.

SICORPS