Regular Expressions in Python
Regular Expressions (also known as REs, regexes or regex patterns) are a small and highly specialised programming language used to search for patterns in text strings. Regular Expressions allow you to specify patterns or rules that you want to match in a string. The patterns you define can be for matching:
1) email addresses
2) IP addresses
3) Dates
4) ID numbers
5) Post codes or anything else that follows a predictable pattern.
The built-in re
module makes regular expressions possible in Python.
Matching Characters
A regular expression pattern is made up of ordinary characters such as a to z and 0 through 9 and special characters called meta characters. Most letters and characters will simply match themselves. If we were looking to find the word ‘dog’ in a string we would simply search for ‘dog’. Meta characters allow us to search for special sequences in the text.
The most common metacharacter is the backslash, \
. The backslash can be followed by another character to represent sets characters. An example:
\d
Matches any digit from 0 to 9\w
Matches word character or alphanumeric character. This is equivalent to a-z, A-Z and 0-9.\s
Matches any whitespace character such as spaces and tabs.
Other meta characters include the following:
[]
Create sets of characters.()
Groups sets of characters together..
Matches any character
Modifiers
Modifiers allow you to specify how many times a particular character or group of characters should be matched.
+
Matches a character 1 or more times,
*
Matches a character 0 or more times.
Searching for patterns
There are a couple of ways to find a pattern within a string. One way to search is to use the re.search()
function.re.search()
scans through an entire string and returns the first substring that matches the search pattern. The re.search()
function take two arguments, a pattern to search for and a string to scan through. Another way to search is to use the re.match()
function. re.match()
does not search through the entire string,it only searches the beginning of the string.
Let’s work on some examples to explain the concepts introduced above.
To start, let’s write a pattern to search for a word that starts with the letter ‘q’.
import re >>> test = 'The quick brown fox jumps over the lazy dog' >>> match = re.search(r'q\w*', test) >>> match.group() 'quick' >>>
In line 1, we import the re
module that enables us to use Python’s Regular Expression engine.
In the next statement, we create a string that contains some words. We call the re.search()
function and supply the regex pattern and test string as arguments. r'q\w*'
is the regular expression. The r
tells Python to create a raw string out of the characters that will follow. Using Python’s raw string notation tells Python not to handle backslashes in any special way.
The ‘q’ in q\w*
matches any word that contains the letter ‘q’. \w*
matches any word character or alphanumeric character that follows ‘q’. The *
is a modifier to \w
that causes the regex to match any word characters 0 or more times. In simple English, 'q\w*'
means “match any word that contains a ‘q’ and is followed by any number of alphanumeric characters”.
The match = re.search(r'q\w*', test)
expression returns a Match Object if the search is successful. If a match is not found, None
is returned. match.group()
returns the string matched by the regular expression.
Grouping
You can group sets of characters together.
import re >>> email = 'ndlovu.v@example.com' >>> match = re.search(r"([\w.]+)@([\w.]+)", email) >>> match.group() 'ndlovu.v@example.com' >>> match.group(1) 'ndlovu.v' >>> match.group(2) 'example.com'
To group text, place it between parentheses.
([\w.]+)
searches for word characters and any other symbols that may make up the username. The parentheses around the patterns create the groups. The groups make it easy to break up the email address into its parts, namely |ndlovu.v| the username and the domain |example.com|.
Replacing Text
Regular expressions, allow you to do more than just match and extract substrings. You can replace strings as well:
import re >>> text = "Hello World" >>> re.sub('World', 'Universe', text) 'Hello Universe'
Conclusion
Regular Expressions are a very powerful and useful tool for wrangling text. It is not recommended to use regular expressions for simple strings. This article only touched on a few features of regex. I have a lot more to learn about them myself.
Thanks!