For our TrustED Conf 2021 VR World Tour, we heard from Andrew Patterson, a Senior Data and DevOps Engineer with over a decade of experience in software engineering on how Regular Expressions can be used for powerful pattern matching, including in Search Console and Google Analytics.
What are Regular Expressions?
Regular Expressions are a syntax used to define a search pattern. They are commonly used in “find” and “find and replace” operations, allowing you to work with literal text, regular expressions extend this to allow you to search for text that matches a pattern instead.
Want to find the email addresses in some text? You could try:
\b[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}\b
But there are many ways to do this!
And I care because …
You can do cool stuff with Regular Expressions! Both Google Analytics and Google Search Console use them. You’ll find them in plenty of other tools too:
- Text editors like Notepad++, EditPad
- Google Docs, Sheets
They are also found extensively in programming languages.
Here be dragons
Because people seldom agree on anything, there are different versions of regular expressions.
- Not all RegEx are equal
- POSIX and Perl
- Basic and Extended (and Simple)
- Different tools will use different versions, so best to check the reference
- GA and GSC use Google’s RE2 syntax
Regular Expressions (fun)damentals
Syntax
No need to worry about remembering all of these Regular Expressions syntaxes. The key is to understand the concepts – generally, I’ll look up the reference for the tool I’m using, as there can be differences between them.
- . – is a wild card, and will match any character
- ? – is existential, does it exist or not?
- * and + will eat as much as they can
- {n,m} – you can omit one of these – {n,} n or more – {,m} upper bound of m
- | – has an identity crisis cat|dog, is it a cat or a dog?
- ( ) – grouping is interesting, it has two purposes – it can affect the order of operations, but it is also used to extract sections of matching text
- [ ] – just wants to hug everyone
- ^ and $ – handy if you know the starting or ending of the pattern you’re looking for
- \ – Allows you to put any of these special characters in as literal characters
The syntax of regular expressions is made up of characters that make up expression elements. These elements can represent:
- The possible characters
- A quantifier of how many of these characters are allowed to match
- Grouping to define scope and precedence
- Anchoring to the start or end
- Boolean or operation
- An escape character, to allow for special characters to be made literal
Character classes
Character classes use the escape syntax to allow for quick definitions of ranges of character values. We go through some of the commonly used ones.
Good to note that not all tools support the same character classes. POSIX is a little different, classes are like [:digit:] and can only be used within bracket expressions.
Modifiers
Modifiers are used to change how a regular expression is run against the target text. Some tools allow for a wide range of these, others will provide some of them as checkbox items, while others won’t have this functionality. We take a look at some of the common modifiers and how they change the operation of the regular expression.
Single line – By modifying the . to also match the new line character, you can also think of it as treating all the text as a single line, since abc.*xyz will match the whole text even if there are only 3 characters per line.
These are generally found in programming languages, and more advanced tools, and are appended to the end of the expression with some additional syntax, like /[a-z]+/gi
There are also things like assertions (lookahead, lookbehind, conditionals, etc), the ability to specify characters using octal or hexadecimal. You won’t need these much.
Lazy and greedy
Regular expressions can be lazy or greedy
^R.*y
Lazy: Regular expressions can be lazy
Greedy: Regular expressions can be lazy or greedy
By default, regular expressions are greedy!
The question mark has another purpose, it can modify things to be lazy. ^R.*?y will use lazy matching.
In some cases, you can also specify to use lazy matching globally using modifiers, like U (for ungreedy).
Walkthrough
Let’s use the example earlier.
\b[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}\b
What will match in the below text?
“Some example email addresses are [email protected], [email protected]. Some more examples include: [email protected], [email protected] and really%this(is)@your.example.nz”
Recently used Regular Expressions
I have used regular expressions a lot in programming and in various tools. Most recently I’ve used them in finding and extracting tokens in URLs, finding specific lines in CSV files, in Google Search Console, and in Google Analytics.
Testing it out
Regular Expression testing sites:
Here are just some useful regular expression testing sites, which you can utilise to test some examples, and build a regular expression.