Keyboard Walking – Backreference Wildcards, revisited

Alright, something that’ll make your keyboard dance with joy: Keyboard Walking! No, not the kind where you press every key on your keyboard in a row (although that can be fun too). We’re talking about using backreference wildcards to walk through regular expressions and match patterns.

Now, if you’ve ever used grep or sed before, you might have noticed those ***** little slashes at the beginning of some regexes: \d, \w, etc. These are called “backreference” wildcards because they reference a previous pattern in your regex. For example, let’s say we want to match all lines that contain exactly three digits followed by a space and then another set of digits (like phone numbers). We can use the backreference wildcard \d{3} to match those first three digits:

bash
# This script uses the grep command to search for a specific pattern in a file.
# The pattern being searched for is a phone number in the format (555) 1234.
# The \s wildcard is used to match a space and the \d{4} wildcard is used to match four digits.
# The input.txt file is the file being searched.

grep '\(555\) \d{4}' input.txt
# The parentheses are escaped with a backslash to indicate they are part of the pattern and not a special character.
# The \s wildcard is used to match a space between the area code and the phone number.
# The \d{4} wildcard is used to match four digits for the phone number.
# The input.txt file is the file being searched.

But what if we want to do something a little more complex? Let’s say we have a list of email addresses and we only want to match the ones that end in “.com” or “.org”. We can use backreference wildcards again, but this time with some parentheses:

grep -E '(.*)@([\w\-]+)\.(com|org)$' input.txt

This regex matches any string that starts with a dot followed by zero or more characters (the first group), then an at sign, then one or more word characters and hyphens (the second group), then a period, “com” or “org”, and finally the end of the line. Pretty cool, right?

But what if we want to match email addresses that have multiple periods in them? For example: [email protected]. We can modify our regex to allow for more than one period by adding a quantifier (the {n} syntax) inside the second group:

grep '(.*)@([\w\-]+\.*)\.(com|org|co\.uk)$' input.txt

Now we can match email addresses that end in “.com”, “.org”, or “.co.uk”. And if you want to get really fancy, you can use backreference wildcards inside other groups too! For example:

grep -E '(https?://)?([\w-]+\.)+(com|org)$' input.txt

This regex matches URLs that start with “http” or “https”, followed by a colon and two slashes, followed by any string of word characters and hyphens (the second group), followed by “.com” or “.org”. The first group is optional because we use the ? quantifier to match zero or one occurrences.

It’s like a dance party in your terminal, but without all the sweat and awkwardness. Give it a try and let us know what crazy regexes you come up with!

SICORPS