· engineering  · 16 min read

Regex: Getting to grips with Regular Expressions

A complete run down on Regular expressions

A complete run down on Regular expressions

Regular expressions (regex), can initially seem like cryptic magic runes - a Hollywood-esque undecipherable combination of numbers, letters, and symbols that somehow does something.

Yet the logic behind them is of course quite understandable, and once you’ve mastered the rules, regex enables you to tackle complex text manipulation tasks with elegance and efficiency.

Still, many developers shy away from learning it, resorting to workarounds or copying code snippets, instead of taking the opportunity to decipher the beautiful mess that is a regular expression.

But, like most things that appear challenging, taking the time to get to grips with it removes the mystery, and allows you to harness the power of regular expressions.

What are regular expressions?

Regular expressions (regex) have been around since the 1950s. Back then mathematicians were trying to describe the underlying structure of languages. Their work led to the development of regular expressions, whose use in computing blossomed in the late 1960s with text editors and compilers needing to search (and replace) specific patterns within text.

Early implementations varied considerably, and we still see slight differences in implementation today. These variations can be broadly grouped into two main categories.

The flavours of regex

Understanding these different “flavors” is important when working with various tools and programming languages, as the way they handle regexes might differ slightly.

POSIX-based: Standardized and typically simpler/faster implementations found in Unix tools like grep, sed, and awk.
Perl-like: Inspired by Perl’s powerful features, this group offers more advanced capabilities but can be slower. Examples include PCRE (used in PHP and Apache) and Tcl’s implementation used in PostgreSQL.

Even within these groups, there are variations. For instance, MySQL uses a variant of POSIX, while ECMAScript (Javascript) has its own built-in syntax. Command-line tools like grep typically use the POSIX flavor as well.

Both flavors share the core concepts of regular expressions, like matching characters, repetitions, and capturing substrings, and basic patterns like a-z (lowercase letters) or \d (digits) will usually work similarly across both flavors.

As a general rule-of-thumb, POSIX patterns can be used in Perl-like interpreter, but not necessarily vice-versa, as the Perl-like flavors offer more advanced features such as backreferences (referencing previously matched text), lookarounds (matching based on surrounding text), and character classes (custom character sets).

Using RegEx (in TypeScript)

The JavaScript regex implementation (and thus TypeScript) is more POSIX like, and thus lacks a lot of the more advanced features developers experienced with other languages (like PHP) may be familiar with.
Libraries like reg-exp or rework can provide functionalities like backreferences and lookarounds, but these introduce additional dependencies and considerations, generally speaking sticking to core JS regex features is the best strategy.

FeaturePCRE-like Regex (PHP, Apache, Nginx, R etc)TypeScript Regex
Core functionalitiesYesYes
BackreferencesYesMay require libraries
LookaroundsYesMay require libraries
Custom character classesYesLimited support
PortabilityLower (variations exist)Higher (within JS engines)

Writing a pattern in Javascript:
A pattern is essentially a string of characters, there are three ways to write them in javascript:

const pattern1 = `cat`; // literally a string
const pattern2 = /cat/i; // a regex literal
const pattern3 = new RegExp('cat'); // with a  constructor
const pattern4 = new RegExp(pattern2.source, 'i'); // using the base pattern with the flag "i" for case-insensitive

The constructor allows for more dynamic and reusable patterns than writing literally, and allows us to strongly type a variable as a regular expression. Flags are added as arguments, rather than part of the pattern.
Regex literals are concise and readable, and are validated by typescript, but cannot be strongly typed for.
String literals are not validated or typed in anyway.

Security considerations

Regular expressions can introduce security vulnerabilities if not used carefully. Most security risks relate to accepting user input to construct a pattern. These risks can range from denial-of-service attacks (e.g., using * excessively) to code injection or exposing unintended data.

Always be wary of, escape, and validate any user input, but especially if it’s going to be used in a regex pattern (or a function like split which can take a pattern or a string). Where possible, limit the ability for users to provide patterns

Vulnerabilities can also occur when using patterns with user input. Regex patterns are not foolproof and can often contain subtle edge cases. Here are some best practices to mitigate these risks:

  • Escape User Input: ALWAYS escape any special characters within user input before processing them with regex to prevent unintended interpretations.
  • Thorough Testing: Test your regex patterns with various inputs, including potential edge cases and malicious attempts.
  • Validate Matches: Validate that any matches from the regex parsing are in the expected format and don’t contain unexpected content.
  • Error Handling: Implement good error handling in case unexpected data causes a function to crash.
  • Sanitization: Consider sanitizing user input (removing potentially harmful characters) before processing it with regex for an extra layer of security.

Using patterns

Here are some examples of using regular expressions for various tasks in TypeScript, but be wary of over-engineering, sometimes simpler string manipulation methods might be the better solution. Write code to be read, not to impress.

  • Validation:

    function isValidEmail(email: string): boolean {
      const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/; // Basic email format
      return emailRegex.test(email);
    }
    const email1 = 'john.doe@example.com'; // Valid
    const email2 = 'invalid_email'; // Invalid
     
    console.log(isValidEmail(email1)); // true
    console.log(isValidEmail(email2)); // false
  • Searching:

    const text = 'The quick brown fox jumps over the lazy dog';
    const searchTerm = 'fox';
    const searchRegex = new RegExp(searchTerm, 'gi'); // "g" for global, "i" for case-insensitive
     
    const searchResult = searchRegex.exec(text); // Returns the first match or null
     
    if (searchResult) {
      console.log('Found:', searchResult[0]); // "fox" (matched text)
      console.log('Index:', searchResult.index); // 16 (starting index)
    } else {
      console.log('Search term not found');
    }
  • Replacing:

    const text = 'The quick brown fox jumps over the lazy dog';
    const replaceTerm = 'fox';
    const replaceWith = 'cat';
    const replaceRegex = new RegExp(replaceTerm, 'g');
     
    const replacedText = text.replace(replaceRegex, replaceWith);
    console.log(replacedText); // "The quick brown cat jumps over the lazy dog"
  • Counting Occurrences:

    const text = 'The weather is sunny today. It is a sunny day.';
    const countRegex = /\bsunny\b/g; // "\b" for word boundary
     
    const matches = text.match(countRegex);
    console.log('Number of occurrences:', matches?.length ?? 0); // 2
  • Splitting:

    const text = 'apple-banana-cherry-orange';
    const splitRegex = /-/;
     
    const fruits = text.split(splitRegex);
    console.log(fruits); // ["apple", "banana", "cherry", "orange"]
  • Capturing

    const text = 'My name is John Doe and my ID is 123456';
     
    // Regex pattern with capturing groups
    const nameRegex = /name is (.*?) and my ID is (\d+)/;
     
    const match = nameRegex.exec(text);
     
    if (match) {
      // Access captured groups
      const name = match[1]; // Captured group 1 (name)
      const id = parseInt(match[2], 10); // Captured group 2 (ID) - convert to number
      console.log('Extracted name:', name);
      console.log('Extracted ID:', id);
    } else {
      console.log('No match found');
    }

It’s worth noting that typescript has no notion of the types or patterns captured. To strongly type these, you’ll need to write a custom function (for example casting your numbers), or less ideally, type assertions.

A note on utf8 and unicode

When working with characters beyond the basic ASCII set, there are a few potential pitfalls.

UTF-8 characters can have variable lengths, so patterns that rely on character boundaries (\b) may have unexpected results - use \b{g}to match on “grapheme clusters” (individual characters).

Consider using the unicode flag (u) to explicitly enable Unicode support, without it, character escapes (\w) might yield unexpected results, and patterns like [a-z] will not match characters with diacritics (for example à, ä etc). Alternatively consider using libraries specifically designed for working with Unicode and regex.

And of course always test, test test.

Performance

Straightforward regex patterns can be very efficient for text manipulation, but its easy to introduce complexity. Complex regex patterns with many nested elements, backreferences, or lookarounds can take longer to process. It’s also important to note that longer texts are more resource-intensive - this doesn’t grow linearly - complex patterns can cause resource consumption to grow exponentially with string length.

  • Keep your patterns as simple as possible.
  • Consider alternative string manipulation techniques (e.g. string splitting), these may be more performant.
  • Cache results if possible, cache and pre-compile the regex object if it’s being used repeatedly.
  • Test and profile, measure how long your regex operations take with real-world data to identify bottlenecks. Consider profiling tools to pinpoint areas for improvement.

Basic elements of patterns

The most basic of patterns are simply strings of characters, for example cat will match where ever it appears in a string.

  • Metacharacters: characters with a special meaning in a regex pattern, such as .^$?()[]+ and others, forming the basis of all but the most simple of patterns .To use them literally we need to escape them, usually with a backslash: \..

  • Character Classes: a set of characters that can be matched. You can specify a range of characters (e.g., a-z for lowercase letters) or list individual characters within square brackets ([]).

    const pattern1 = /[aeiou]/; // Matches any vowel (a, e, i, o, u)
    const pattern2 = /[^aeiou]/; // Matches any character except vowels (using negation)
    const pattern3 = /[a-z0-9]/; // Matches any lowercase letter or number
  • Alternation: Use the pipe symbol (|) to specify alternative patterns to match. The engine will try each option in sequence until it finds a match.

    const pattern1 = /red|blue|green/; // Matches "red", "blue", or "green"
    const pattern2 = /ca(t|n)/; // Matches "cat" or "can"
  • Quantifiers: These specify how many times the preceding pattern must can be matched, common quantifiers are:

    • ? Match zero or one time (makes a pattern optional).
    • \* Matches zero or more times.
    • + Match one or more times.
    • {n} Match exactly n times.
    • {n,} Match n or more times.
    • {n,m} Match at least n and at most m times.
    const pattern1 = /colou?r/; // Matches "color" or "colour" (optional "u")
    const pattern2 = /ab{2}c/; // Matches "abc" (exactly two "b" characters)
    const pattern3 = /fo{1,3}d/; // Matches "food", "foofd", or "fooood" (1 to 3 "o" characters)

    We can make quantifiers greedyor lazy by adding a ? symbol to determine what they capture, we’ll revisit this further on.

  • Anchors: specify where a pattern must match within the string. Commonly used anchors are:

    • ^ - the beginning of a string (or line with the m flag) - this anchor appears at the beginning of the pattern.

    • $ - the end of the string (or line with the m flag) - this anchor is always placed at the end of the pattern.

    • \b- a word boundary (between a word and non-word character or the beginning/end of the string)

      note: circumflex (^) has a context-dependent interpretation, it represents an anchor when it is at the beginning of a pattern, but a negation when within character classes.

    const pattern1 = /^The/; // Matches strings that start with "The" (e.g., "The quick brown fox")
    const pattern2 = /fox$/; // Matches strings that end with "fox" (e.g., "The quick brown fox")
    const pattern3 = /\bfox\b/; // Matches the word "fox" (not "foxes" or "firefox")
  • Escaping: early we mentioned escaping individual characters with a backslash. Escaping has a second meaning too in regex - to denote an escape sequence. An escape sequence, for example \d does not escape the literal character d but rather specifies a specific search sequence (in this case any digit).
    Some common escape sequences are:

    • \d: Matches any single digit (0-9)
    • \w: Matches any word character (alphanumeric and underscore)
    • \s: Matches any whitespace character (space, tab, newline, etc.)
    • \b: Matches a word boundary
    • .: Matches any character. Not technically an escape sequence (it doesn’t need a backslash), but behaves like one.
    const pattern1 = /\d{3}-\d{4}/; // Matches phone numbers in the format XXX-XXXX (3 digits followed by "-" and 4 digits)
    const pattern2 = /\b\w+\s+\w+\b/; // Matches two consecutive words (separated by whitespace)
  • Grouping: Using parenthesis (...) we can create a subpattern, which can be useful for both readability and applying alternations or a quantifier to.
    By default a group is ‘capturing’ - that is, the matched text is captured, assigned an order based on the order of opening parenthesis, and available later.
    A non capturing group is made by adding ?:immediately after the opening parenthesis, e.g. (?:...).
    A special non capturing group is (?!...) - this is a “negative lookahead” assertion, and means a match must NOT be followed by the subpattern in the parenthesis.

    const pattern1 = /(Mr|Mrs)\. (\w+)/; // Captures title (Mr or Mrs) and name (.+)
    const pattern2 = /(?:Mr|Mrs)\. (\w+)/; // Does not capture title (Mr or Mrs), only name (.+)
    const pattern3 = /\d{3}(?!\d)/; // Matches a 3-digit number not followed by another digit (negative assertion without capturing)
  • Flags: These change the behavior of the regex engine.
    The following common flags are supported in most implementations, but be aware that there are more.

    • u: Enable unicode mode (for UTF-8 characters, diacritics etc).
    • i: Case insensitive
    • g: Global - by default the first match only is returned. If you wanted to match every occurrence, you would need this flag.
    • m: Multiline - The ^ and $ anchors will match the start and end of each line, rather than the whole string.

    We can of course combine multiple flags, for example /cat/gi.

  • Greedy vs. Lazy Quantifiers:
    Having looked at grouping and capturing, we need to return briefly to looking at quantifiers.
    Quantifiers tells us how often a pattern can appear, but the can also be greedyor lazy in terms of how much of the text they capture, that is, they either match the longest or shortest string possible.
    By default, quantifiers tend to be greedy, we make them lazy by following them with a ?.
    Note: some quantifiers are inherently lazy, for example {n}, thus adding ? to make ({n}?), whilst valid, will have no discernable effect.

    Greedy can be useful when we want to capture everything that matches, but lazy can be more precise when you only need the first occurrence or a specific part of the match. Lazy can also be more performant, as it requires the engine to backtrack less.

    Understanding greedy and lazy is crucial for both performance, and getting the expected results.

    Example 1
    Consider the string <div>Content inside a div</div><div>Another div</div>

    • Greedy: <div>.*<\/div> - This will capture everything from the first <div> to the last </div> in a single match, (the whole string in this case).
    • Lazy: <div>.*?<\/div>- This will have two matches, one for each <div> element, as one would expect.

    Example 2
    Consider this string: This is the first sentence. This is the second sentence. And this, the third.

    • Greedy: /.*\./g - This will again match the whole string.
    • Lazy: /.*?\./g- This matches as little text as possible, stopping at the first period it encounters each time. As a result, it matches each sentence individually.

More advanced usage

The basics covered above should be applicable whatever implementation of regular expressions you’re using.
Some of the more advanced features we will be covering next might not be possible with your implementation, especially if it is more POSIX like than Perl-like.
In php the extension PCRE (Perl Compatible Regular Expressions) will provide support for these.

  • Lookarounds allow us to consider the characters before or after our pattern.
    For example, we may wish to match any numbers immediately followed by a currency code (lookahead), or preceeded by a currency symbol (lookbehind).
    It is also possible to have a negative lookahead (e.g. numbers not followed by a currency symbol). However, the opposite, a negative lookbehind is computationally expensive affair, and as a Positive lookahead checking for the absence of a pattern can achieve similar results are rarely implemented.

    Lookaround TypeNotationDescription
    Positive Lookahead(?=pattern)Matches if followed by a specific pattern (not included in match)
    Negative Lookahead(?!pattern)Matches if not followed by a specific pattern (not included in match)
    Positive Lookbehind(?<=pattern)Matches if preceded by a specific pattern (not included in match) Not supported by all engines
    Negative LookbehindRarely implemented

    Some examples:

    /\d+(?=[A-Z]{3})/ // One or more digits followed by three uppercase letters (positive lookahead, 3000AUD will match 3000)
    /red(?!dit)/  // Match "red" not followed by "dit" (negative lookahead, will match redcoat but not reddit)
    /(?<=[\$€£])\d+/  // Match currency symbol followed by one or more digits (positive lookbehind, $3000 will match 3000)
  • More on grouping:
    Non capturing groups (?:) can improve performance. The entire enclosed pattern forms an atomic unit during matching, hence the name “atomic grouping”. While not always necessary, it can improve performance in specific scenarios by preventing the regex engine from exploring unnecessary backtracks.

    Naming Groups: To give a capturing group a specific name, you can use the syntax (?<name>...) where name is the desired name, and the content within parentheses is the pattern you want to capture. It’s not available in all engines, but when it is improves readability and makes accessing the matches easier.

    const pattern1 = /(?:Mr|Mrs|Ms|Miss)\. (\w+)/; // Does not capture title, but placing the alternations in a group improves performance
    const pattern2 = /(?P<title>Mr|Mrs)\. (?P<name>\w+)/; // Captures title and name into named groups (will not work in typescript)
  • Back references: allow us to refer back to earlier captured pattern parts. This allows us to match for repetitions and perform various validations in a single pattern. A backreference is simply written as \i where i is the number of the matched group.

    Example - confirming password repetition: Given this json,

    {"password": "1234secure", "confirmation": "1234secure"}
    

    The pattern:

    /"password": "(\w+)",\s*"confirmation": "\1"/
    

    Would only match if password and confirmation were the same

    Here’s how it works:

    1. /“password”: ”: Matches the key-value pair for “password”.
    2. (\w+): Captures one or more word characters as the password in group 1.
    3. ,: Matches a comma separating the key-value pairs.
    4. \s*: Matches zero or more whitespace characters.
    5. /“confirmation”: ”: Matches the key-value pair for “confirmation”.
    6. \1: This is the backreference. It checks if the string after the colon matches the same text captured by group 1 (\w+).

    (obviously this example is an over simplification, it requires your JSON to be very specific)

    Note: Backreferences are supported in many but not all engines.

Debugging & Testing regex

Debugging and testing regular expressions can be difficult due to their complexity and subtlety. However, several tools and techniques can help you ensure your regex patterns work correctly.

  • Regex101
    An excellent online regex tester that allows you to input your regex pattern and test strings for various engines. It provides real-time feedback, highlights matches, explains the components of your regex, an offers a library of patterns.
  • regexr
    Another excellent online tool, with very similar functionality.
  • Debuggerx
    An online pattern debugging tool, with an interesting way of visualizing your patterns.
  • Browser Developer Tools Most modern web browsers come with built-in developer tools that include a console where you can test regex patterns directly. This can be especially useful for debugging regex in your JavaScript code.

Conclusion

Regular expressions are a powerful tool for text manipulation, allowing us to perform complex searches, replacements, and validations efficiently.

Despite their initially intimidating appearance, mastering regex is an essential skill, whether you’re validating user input, parsing data, or performing advanced text processing.

Take the time to practice and experiment with regex patterns, and you’ll soon find them an invaluable part of your programming toolkit.

James Babington

About James Babington

A cloud architect and engineer with a wealth of experience across AWS, web development, and security, James enjoys writing about the technical challenges and solutions he's encountered, but most of all he loves it when a plan comes together and it all just works.

Comments

No comments yet. Be the first to comment!

Leave a Comment

Check this box if you don't want your comment to be displayed publicly.

Back to Blog

Related Posts

View All Posts »