String Manipulation part 2: Regular Expressions

Regular Expressions in Python

Regular Expressions (also known as REs, regexes or regex patterns) are a small and highly specialised programming language used to search for patterns in text strings. Regular Expressions allow you to specify patterns or rules that you want to match in a string. The patterns you define can be for matching:
1) email addresses
2) IP addresses
3) Dates
4) ID numbers
5) Post codes or anything else that follows a predictable pattern.

The built-in re module makes regular expressions possible in Python.

Matching Characters

A regular expression pattern is made up of ordinary characters such as a to z and 0 through 9 and special characters called meta characters. Most letters and characters will simply match themselves. If we were looking to find the word ‘dog’ in a string we would simply search for ‘dog’. Meta characters allow us to search for special sequences in the text.

The most common metacharacter is the backslash, \. The backslash can be followed by another character to represent sets characters. An example:

  • \d Matches any digit from 0 to 9
  • \w Matches word character or alphanumeric character. This is equivalent to a-z, A-Z and 0-9.
  • \s Matches any whitespace character such as spaces and tabs.

Other meta characters include the following:

  • [] Create sets of characters.
  • () Groups sets of characters together.
  • . Matches any character

Modifiers

Modifiers allow you to specify how many times a particular character or group of characters should be matched.

+ Matches a character 1 or more times,
* Matches a character 0 or more times.

Searching for patterns

There are a couple of ways to find a pattern within a string. One way to search is to use the re.search() function.re.search() scans through an entire string and returns the first substring that matches the search pattern. The re.search() function take two arguments, a pattern to search for and a string to scan through. Another way to search is to use the re.match() function. re.match() does not search through the entire string,it only searches the beginning of the string.

Let’s work on some examples to explain the concepts introduced above.

To start, let’s write a pattern to search for a word that starts with the letter ‘q’.

import re


>>> test = 'The quick brown fox jumps over the lazy dog'
>>> match = re.search(r'q\w*', test)
>>> match.group()
'quick'
>>>

In line 1, we import the re module that enables us to use Python’s Regular Expression engine.
In the next statement, we create a string that contains some words. We call the re.search() function and supply the regex pattern and test string as arguments. r'q\w*' is the regular expression. The r tells Python to create a raw string out of the characters that will follow. Using Python’s raw string notation tells Python not to handle backslashes in any special way.

The ‘q’ in q\w* matches any word that contains the letter ‘q’. \w* matches any word character or alphanumeric character that follows ‘q’. The * is a modifier to \w that causes the regex to match any word characters 0 or more times. In simple English, 'q\w*' means “match any word that contains a ‘q’ and is followed by any number of alphanumeric characters”.

The match = re.search(r'q\w*', test) expression returns a Match Object if the search is successful. If a match is not found, None is returned. match.group() returns the string matched by the regular expression.

Grouping

You can group sets of characters together.

import re

>>> email = 'ndlovu.v@example.com'
>>> match = re.search(r"([\w.]+)@([\w.]+)", email)
>>> match.group()
'ndlovu.v@example.com'
>>> match.group(1)
'ndlovu.v'
>>> match.group(2)
'example.com'


To group text, place it between parentheses.
([\w.]+) searches for word characters and any other symbols that may make up the username. The parentheses around the patterns create the groups. The groups make it easy to break up the email address into its parts, namely |ndlovu.v| the username and the domain |example.com|.

Replacing Text

Regular expressions, allow you to do more than just match and extract substrings. You can replace strings as well:


import re

>>> text = "Hello World"
>>> re.sub('World', 'Universe', text)
'Hello Universe'

Conclusion

Regular Expressions are a very powerful and useful tool for wrangling text. It is not recommended to use regular expressions for simple strings. This article only touched on a few features of regex. I have a lot more to learn about them myself.

1 thought on “String Manipulation part 2: Regular Expressions”

Comments are closed.