U. Regex - JulTob/Python GitHub Wiki
The regular expression library re must be imported into your program before you can use it. The simplest use of the regular expression library is the search() function. The following program demonstrates a trivial use of the search function.
# Search for lines that contain 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('From:', line):
print(line)
We open the file, loop through each line, and use the regular expression search() to only print out lines that contain the string "From:". This program does not use the real power of regular expressions, since we could have just as easily used line.find() to accomplish the same result.
The power of the regular expressions comes when we add special characters to the search string that allow us to more precisely control which lines match the string. Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code. For example, the caret character is used in regular expressions to match "the beginning" of a line. We could change our program to only match lines where "From:" was at the beginning of the line as follows:
# Search for lines that start with 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^From:', line):
print(line)
Now we will only match lines that start with the string "From:". This is still a very simple example that we could have done equivalently with the startswith() method from the string library. But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.
Notation | Matchs to | Example | gets |
---|---|---|---|
literal | Literal string | foo | "foo" |
re1|re2 |
re1 or re2
|
foo|bar | "foo", "bar" |
. | Any character | b.b | "bab", "bbb", "bcb"... |
^ | start of string | ^Dear | Dear, Dearest, Dearly... |
$ | End of string | /bin/*sh$ | |
.jpg$ | all Images jpg format | ||
^a$ | Literal full string | ^Text$ | "Text" |
+ | 1 or more of precedent | [a-z]+.com | |
? | 0 or 1 | goo? | |
{N} | N ocurrences | [0-9]{3} | |
{N,M} | N to M ocurrences | [0-9]{5,9} | |
[...] | any single character | [aeiou] | |
b[aei]t | bat, bet, bit | ||
[..x-y..] | Any char in rane x to y | [0-9],[A-Za-z] | |
[^...] | Do not match | [^aeiou], [^A-Za-z0-9_] | |
(*|+|?]{})? | Apply "non-greedy" of symbol | .*?[a-z] | |
(...) | Subgroup | ([0-9]{3})? , f(oo|u)bar |
Special Characters | ||
---|---|---|
\d | decimal digit | data\d+.txt |
\D | NOT decimal digit | |
\w | Alphanumeric char | [A-Za-z]\w+ |
\W | Not that | |
\s | WhiteSpace | |
\S | Not | |
\b | Word Boundary | |
\bthe | Word that stars with "the" | |
\bthe\b | Only "the" | |
\Bthe | String that contains but not begins with "the" | |
\N | Subgroup N | \16 |
\A (\Z) | Start (end) of string |
Extensions | Explanation | Example |
---|---|---|
(?iLmsux) | flags | (?x), (?im) |
(?:...) | (?:\w.)* | |
(?P..) | (?P) | |
(?P=name) | Matches | (?=data) |
(?#...) | Comment | (?#comment) |
(?=...) | positive lookahead assertion | (?=.com) |
(?<=...) | Positive lookbehind assertion | (?<=800-) |
(?<!...) | negative lookbehind assertion | (?<!192ª.168ª.) |
(?(id/name)Y|N) | Conditional match of regex Y if group with given id or name exists else N; | N is optional | (?(1)y|x) |
The pipe symbol |
, a vertical bar on your keyboard, indicates an alternation operation. It used to separate different regular expressions. For example, the following are some patterns that employ alternation, along with the strings they match:
Regex Pattern | Strings Matched |
---|---|
at|home | at, home |
r2d2|c3po | r2d2, c3po |
bat|bet|bit | bat,bet,bit |
With this one symbol, we have just increased the flexibility of our regular expressions, enabling the matching of more than just one string. Alternation is also sometimes called union or logical OR.
Pattern | Explanation | Mathed |
---|---|---|
f.o | Any character | fao, fqo, fxo... |
.. | Any pair of characters | aa, ab, ac .,. xx |
.end | Any character before the string end |
The dot character is input as "."