A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expression are popularly known as regex or regexp.
Usually, such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation.
Large scale text processing in data science projects requires manipulation of textual data. The regular expressions processing is supported by many programming languages including Python. Python”s standard library has re module for this purpose.
Since most of the functions defined in re module work with raw strings, let us first understand what the raw strings are.
Raw Strings
Regular expressions use the backslash character (””) to indicate special forms or to allow special characters to be used without invoking their special meaning. Python on the other hand uses the same character as escape character. Hence Python uses the raw string notation.
A string become a raw string if it is prefixed with r or R before the quotation symbols. Hence ”Hello” is a normal string were are r”Hello” is a raw string.
>>> normal="Hello" >>> print (normal) Hello >>> raw=r"Hello" >>> print (raw) Hello
In normal circumstances, there is no difference between the two. However, when the escape character is embedded in the string, the normal string actually interprets the escape sequence, where as the raw string doesn”t process the escape character.
>>> normal="HellonWorld" >>> print (normal) Hello World >>> raw=r"HellonWorld" >>> print (raw) HellonWorld
In the above example, when a normal string is printed the escape character ”n” is processed to introduce a newline. However because of the raw string operator ”r” the effect of escape character is not translated as per its meaning.
Metacharacters
Most letters and characters will simply match themselves. However, some characters are special metacharacters, and don”t match themselves. Meta characters are characters having a special meaning, similar to * in wild card.
Here”s a complete list of the metacharacters −
. ^ $ * + ? { } [ ] | ( )
The square bracket symbols[ and ] indicate a set of characters that you wish to match. Characters can be listed individually, or as a range of characters separating them by a ”-”.
Sr.No. | Metacharacters & Description |
---|---|
1 |
[abc] match any of the characters a, b, or c |
2 |
[a-c] which uses a range to express the same set of characters. |
3 |
[a-z] match only lowercase letters. |
4 |
[0-9] match only digits. |
5 |
”^” complements the character set in [].[^5] will match any character except”5”. |
””is an escaping metacharacter. When followed by various characters it forms various special sequences. If you need to match a [ or , you can precede them with a backslash to remove their special meaning: [ or \.
Predefined sets of characters represented by such special sequences beginning with ”” are listed below −
Sr.No. | Metacharacters & Description |
---|---|
1 |
d Matches any decimal digit; this is equivalent to the class [0-9]. |
2 |
D Matches any non-digit character; this is equivalent to the class [^0-9]. |
3 | sMatches any whitespace character; this is equivalent to the class [tnrfv]. |
4 |
S Matches any non-whitespace character; this is equivalent to the class [^tnrfv]. |
5 |
w Matches any alphanumeric character; this is equivalent to the class [a-zAZ0-9_]. |
6 |
W Matches any non-alphanumeric character. equivalent to the class [^a-zAZ0-9_]. |
7 |
. Matches with any single character except newline ”n”. |
8 |
? match 0 or 1 occurrence of the pattern to its left |
9 |
+ 1 or more occurrences of the pattern to its left |
10 |
* 0 or more occurrences of the pattern to its left |
11 |
b boundary between word and non-word and /B is opposite of /b |
12 |
[..] Matches any single character in a square bracket and [^..] matches any single character not in square bracket. |
13 |
It is used for special meaning characters like . to match a period or + for plus sign. |
14 |
{n,m} Matches at least n and at most m occurrences of preceding |
15 |
a| b Matches either a or b |
Python”s re module provides useful functions for finding a match, searching for a pattern, and substitute a matched string with other string etc.
The re.match() Function
This function attempts to match RE pattern at the start of string with optional flags. Following is the syntax for this function −
re.match(pattern, string, flags=0)
Here is the description of the parameters −
Sr.No. | Parameter & Description |
---|---|
1 |
pattern This is the regular expression to be matched. |
2 |
String This is the string, which would be searched to match the pattern at the beginning of string. |
3 |
Flags You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below. |
The re.match() function returns a match object on success, None on failure. A match object instance contains information about the match: where it starts and ends, the substring it matched, etc.
The match object”s start() method returns the starting position of pattern in the string, and end() returns the endpoint.
If the pattern is not found, the match object is None.
We use group(num) or groups() function of match object to get matched expression.
Sr.No. | Match Object Methods & Description |
---|---|
1 | group(num=0)This method returns entire match (or specific subgroup num) |
2 | groups()This method returns all matching subgroups in a tuple (empty if there weren”t any) |
Example
import re line = "Cats are smarter than dogs" matchObj = re.match( r''Cats'', line) print (matchObj.start(), matchObj.end()) print ("matchObj.group() : ", matchObj.group())
It will produce the following output −
0 4 matchObj.group() : Cats
The re.search() Function
This function searches for first occurrence of RE pattern within the string, with optional flags. Following is the syntax for this function −
re.search(pattern, string, flags=0)
Here is the description of the parameters −
Sr.No. | Parameter & Description |
---|---|
1 |
Pattern This is the regular expression to be matched. |
2 |
String This is the string, which would be searched to match the pattern anywhere in the string. |
3 |
Flags You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below. |
The re.search function returns a match object on success, none on failure. We use group(num) or groups() function of match object to get the matched expression.
Sr.No. | Match Object Methods & Description |
---|---|
1 | group(num=0)This method returns entire match (or specific subgroup num) |
2 | groups()This method returns all matching subgroups in a tuple (empty if there weren”t any) |
Example
import re line = "Cats are smarter than dogs" matchObj = re.search( r''than'', line) print (matchObj.start(), matchObj.end()) print ("matchObj.group() : ", matchObj.group())
It will produce the following output −
17 21 matchObj.group() : than
Matching Vs Searching
Python offers two different primitive operations based on regular expressions, match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).
Example
import re line = "Cats are smarter than dogs"; matchObj = re.match( r''dogs'', line, re.M|re.I) if matchObj: print ("match --> matchObj.group() : ", matchObj.group()) else: print ("No match!!") searchObj = re.search( r''dogs'', line, re.M|re.I) if searchObj: print ("search --> searchObj.group() : ", searchObj.group()) else: print ("Nothing found!!")
When the above code is executed, it produces the following output −
No match!! search --> matchObj.group() : dogs
The re.findall() Function
The findall() function returns all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
Syntax
re.findall(pattern, string, flags=0)
Parameters
Sr.No. | Parameter & Description |
---|---|
1 |
Pattern This is the regular expression to be matched. |
2 |
String This is the string, which would be searched to match the pattern anywhere in the string. |
3 |
Flags You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below. |
Example
import re string="Simple is better than complex." obj=re.findall(r"ple", string) print (obj)
It will produce the following output −
[''ple'', ''ple'']
Following code obtains the list of words in a sentence with the help of findall() function.
import re string="Simple is better than complex." obj=re.findall(r"w*", string) print (obj)
It will produce the following output −
[''Simple'', '''', ''is'', '''', ''better'', '''', ''than'', '''', ''complex'', '''', '''']
The re.sub() Function
One of the most important re methods that use regular expressions is sub.
Syntax
re.sub(pattern, repl, string, max=0)
This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max is provided. This method returns modified string.
Example
import re phone = "2004-959-559 # This is Phone Number" # Delete Python-style comments num = re.sub(r''#.*$'', "", phone) print ("Phone Num : ", num) # Remove anything other than digits num = re.sub(r''D'', "", phone) print ("Phone Num : ", num)
It will produce the following output −
Phone Num : 2004-959-559 Phone Num : 2004959559
Example
The following example uses sub() function to substitute all occurrences of is with was word −
import re string="Simple is better than complex. Complex is better than complicated." obj=re.sub(r''is'', r''was'',string) print (obj)
It will produce the following output −
Simple was better than complex. Complex was better than complicated.
The re.compile() Function
The compile() function compiles a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.
Syntax
re.compile(pattern, flags=0)
Flags
Sr.No. | Modifier & Description |
---|---|
1 |
re.I Performs case-insensitive matching. |
2 |
re.L Interprets words according to the current locale. This interpretation affects the alphabetic group (w and W), as well as word boundary behavior (b and B). |
3 |
re. M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string). |
4 |
re.S Makes a period (dot) match any character, including a newline. |
5 |
re.U Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, b, B. |
6 |
re.X Permits “cuter” regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker. |
The sequence −
prog = re.compile(pattern) result = prog.match(string)
is equivalent to −
result = re.match(pattern, string)
But using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
Example
import re string="Simple is better than complex. Complex is better than complicated." pattern=re.compile(r''is'') obj=pattern.match(string) obj=pattern.search(string) print (obj.start(), obj.end()) obj=pattern.findall(string) print (obj) obj=pattern.sub(r''was'', string) print (obj)
It will produce the following output −
7 9 [''is'', ''is''] Simple was better than complex. Complex was better than complicated.
The re.finditer() Function
This function returns an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.
Syntax
re.finditer(pattern, string, flags=0)
Example
import re string="Simple is better than complex. Complex is better than complicated." pattern=re.compile(r''is'') iterator = pattern.finditer(string) print (iterator ) for match in iterator: print(match.span())
It will produce the following output −
(7, 9) (39, 41)
Use Cases of Python Regex
Finding all Adverbs
findall() matches all occurrences of a pattern, not just the first one as search() does. For example, if a writer wanted to find all of the adverbs in some text, they might use findall() in the following manner −
import re text = "He was carefully disguised but captured quickly by police." obj = re.findall(r"w+lyb", text) print (obj)
It will produce the following output −
[''carefully'', ''quickly'']
Finding words starting with vowels
import re text = ''Errors should never pass silently. Unless explicitly silenced.'' obj=re.findall(r''b[aeiouAEIOU]w+'', text) print (obj)
It will produce the following output −
[''Errors'', ''Unless'', ''explicitly'']
Regular Expression Modifiers: Option Flags
Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −
Sr.No. | Modifier & Description |
---|---|
1 |
re.I Performs case-insensitive matching. |
2 |
re.L Interprets words according to the current locale. This interpretation affects the alphabetic group (w and W), as well as word boundary behavior(b and B). |
3 |
re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string). |
4 |
re.S Makes a period (dot) match any character, including a newline. |
5 |
re.U Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, b, B. |
6 |
re.X Permits “cuter” regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker. |
Regular Expression Patterns
Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | ), all characters match themselves. You can escape a control character by preceding it with a backslash.
Following table lists the regular expression syntax that is available in Python −
Sr.No. | Pattern & Description |
---|---|
1 |
^ Matches beginning of line. |
2 |
$ Matches end of line. |
3 |
. Matches any single character except newline. Using m option allows it to match newline as well. |
4 |
[…] Matches any single character in brackets. |
5 |
[^…] Matches any single character not in brackets |
6 |
re* Matches 0 or more occurrences of preceding expression. |
7 |
re+ Matches 1 or more occurrence of preceding expression. |
8 |
re? Matches 0 or 1 occurrence of preceding expression. |
9 |
re{ n} Matches exactly n number of occurrences of preceding expression. |
10 |
re{ n,} Matches n or more occurrences of preceding expression. |
11 |
re{ n, m} Matches at least n and at most m occurrences of preceding expression. |
12 |
a| b Matches either a or b. |
13 |
(re) Groups regular expressions and remembers matched text. |
14 |
(?imx) Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected. |
15 |
(?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected. |
16 |
(?: re) Groups regular expressions without remembering matched text. |
17 |
(?imx: re) Temporarily toggles on i, m, or x options within parentheses. |
18 |
(?-imx: re) Temporarily toggles off i, m, or x options within parentheses. |
19 |
(?#…) Comment. |
20 |
(?= re) Specifies position using a pattern. Doesn”t have a range. |
21 |
(?! re) Specifies position using pattern negation. Doesn”t have a range. |
22 |
(?> re) Matches independent pattern without backtracking. |
23 |
w Matches word characters. |
24 |
W Matches nonword characters. |
25 |
s Matches whitespace. Equivalent to [tnrf]. |
26 |
S Matches nonwhitespace. |
27 |
d Matches digits. Equivalent to [0-9]. |
28 |
D Matches nondigits. |
29 |
A Matches beginning of string. |
30 |
Z Matches end of string. If a newline exists, it matches just before newline. |
31 |
z Matches end of string. |
32 |
G Matches point where last match finished. |
33 |
b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets. |
34 |
B Matches nonword boundaries. |
35 |
n, t, etc. Matches newlines, carriage returns, tabs, etc. |
36 |
1…9 Matches nth grouped subexpression. |
37 |
10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code. |
Regular Expression Examples
Literal characters
Sr.No. | Example & Description |
---|---|
1 |
python Match “python”. |
Character classes
Sr.No. | Example & Description |
---|---|
1 |
[Pp]ython Match “Python” or “python” |
2 |
rub[ye] Match “ruby” or “rube” |
3 |
[aeiou] Match any one lowercase vowel |
4 |
[0-9] Match any digit; same as [0123456789] |
5 |
[a-z] Match any lowercase ASCII letter |
6 |
[A-Z] Match any uppercase ASCII letter |
7 |
[a-zA-Z0-9] Match any of the above |
8 |
[^aeiou] Match anything other than a lowercase vowel |
9 |
[^0-9] Match anything other than a digit |
Special Character Classes
Sr.No. | Example & Description |
---|---|
1 |
. Match any character except newline |
2 |
d Match a digit: [0-9] |
3 |
D Match a nondigit: [^0-9] |
4 |
s Match a whitespace character: [ trnf] |
5 |
S Match nonwhitespace: [^ trnf] |
6 |
w Match a single word character: [A-Za-z0-9_] |
7 |
W Match a nonword character: [^A-Za-z0-9_] |
Repetition Cases
Sr.No. | Example & Description |
---|---|
1 |
ruby? Match “rub” or “ruby”: the y is optional |
2 |
ruby* Match “rub” plus 0 or more ys |
3 |
ruby+ Match “rub” plus 1 or more ys |
4 |
d{3} Match exactly 3 digits |
5 |
d{3,} Match 3 or more digits |
6 |
d{3,5} Match 3, 4, or 5 digits |
Nongreedy repetition
This matches the smallest number of repetitions −
Sr.No. | Example & Description |
---|---|
1 |
<.*> Greedy repetition: matches “<python>perl>” |
2 |
<.*?> Nongreedy: matches “<python>” in “<python>perl>” |
Grouping with Parentheses
Sr.No. | Example & Description |
---|---|
1 |
Dd+ No group: + repeats d |
2 |
(Dd)+ Grouped: + repeats Dd pair |
3 |
([Pp]ython(, )?)+ Match “Python”, “Python, python, python”, etc. |
Backreferences
This matches a previously matched group again −
Sr.No. | Example & Description |
---|---|
1 |
([Pp])ython&1ails Match python&pails or Python&Pails |
2 |
([””])[^1]*1 Single or double-quoted string. 1 matches whatever the 1st group matched. 2 matches whatever the 2nd group matched, etc. |
Alternatives
Sr.No. | Example & Description |
---|---|
1 |
python|perl Match “python” or “perl” |
2 |
rub(y|le)) Match “ruby” or “ruble” |
3 |
Python(!+|?) “Python” followed by one or more ! or one ? |
Anchors
This needs to specify match position.
Sr.No. | Example & Description |
---|---|
1 |
^Python Match “Python” at the start of a string or internal line |
2 |
Python$ Match “Python” at the end of a string or line |
3 |
APython Match “Python” at the start of a string |
4 |
PythonZ Match “Python” at the end of a string |
5 |
bPythonb Match “Python” at a word boundary |
6 |
brubB B is nonword boundary: match “rub” in “rube” and “ruby” but not alone |
7 |
Python(?=!) Match “Python”, if followed by an exclamation point. |
8 |
Python(?!!) Match “Python”, if not followed by an exclamation point. |
Special Syntax with Parentheses
Sr.No. | Example & Description |
---|---|
1 |
R(?#comment) Matches “R”. All the rest is a comment |
2 |
R(?i)uby Case-insensitive while matching “uby” |
3 |
R(?i:uby) Same as above |
4 |
rub(?:y|le)) Group only without creating 1 backreference |