Regex - haltosan/RA-python-tools GitHub Wiki

How To

Regex Syntax

An explanation from the beginning! For these examples, assume the regex must match the whole string. (Depending on the method you use, this may not be the case -- you can also match parts of strings. This is just to show what the regex itself would match.)

Regex Will Match... Won't Match... Explanation
Characters:
a 'a' 'b', 'ab', 'cat', etc. Written plainly, characters will match only themselves once (unless they have a special meaning).
. 'a', 'b', etc. 'aa', '\n', etc. The '.' symbol matches any single character except for a newline. See cheat sheet for more symbols.
cat 'cat' 'ca', 'cats', etc. The regex must match all plainly-written characters (all or nothing).
[ab] 'a', 'b' 'c', 'ab', etc. The [] define a character class. Any character inside is an option while matching.
[a-z] 'a', 'b'...'z' 'A', 'zz', etc. Character classes can hold ranges. Any single character from a-z will match (case sensitive).
[A-Za-z] 'A'...'Z', 'a'...'z' 'Aa', '?', etc. Character classes can hold multiple ranges. Any character from A-Z or a-z will match.
[^a-z] 'A', '?', etc. 'a'...'z' The ^ symbol negates, so this character class matches anything that's not from a-z.
Quantifiers:
a|b 'a', 'b' 'ab', 'ba' The '|' symbol means match a or b (only one or the other)
a* '', 'a', 'aa'... 'b', '?', etc. The '*' symbol matches zero or more of whatever is right in front of it, as many as possible.
a+ 'a', 'aa'... '', 'b', etc. The '+' symbol matches one or more of whatever is right in front of it, as many as possible.
a? '', 'a' 'aa', 'b', etc. The '?' means optional, matching zero or one of whatever is right in front of it, as many as possible.
a{2} 'aa' '', 'a', 'aaa', etc. The {} match a specified number of whatever is right in front of them, as many as possible.
a{2,3} 'aa', 'aaa' '', 'a', 'aaaa', etc. The {} can also hold a range of numbers. {2,} represents 2 or more (no upper limit).
*?, +?, ??, {}? Adding a '?' symbol onto a quantifier makes it match the lowest number of times it can ('lazy' instead of 'greedy')

The table below uses '...' to represent any regular expression put inside the grouping symbols.

Grouping Symbol Name Explanation
Common Groups
(...) Capturing Group Whatever the regex inside the () matches will be stored in its own string. When you get a match, you can access what each group captured separately, ex. match.group(1), match.group(2), etc.
(?P<name>...) Named Capturing Group The same as a capturing group, except you can access it by name later, ex. match.group('name')
(?:...) Non-capturing Group Use the regex inside to identify a match, but don't actually store what it matches in a group. Useful for keywords in the text (ex. 'born on') that aren't important data themselves.
Complex Groups
(?P=name) Backreference Match only to what a previous group captured. For example, if (?P...) captured 'Scott' in the text, match to 'Scott'. (?P<word>...) (?P=word) would match to 'apple apple' but not 'apple orange'.
(?<=...) Positive Lookbehind Only match if the contents of the group appear beforehand. For example, (?<=apple )orange would match to 'orange' if the text was 'apple orange', but match to nothing if the text was just 'orange'. Basically, "Match, but only if this is right before it." Nothing in the lookbehind is included in the actual match.
(?<!...) Negative Lookbehind The opposite of a Positive Lookbehind. Basically, "Don't match if this is right before it."
(?=...) Positive Lookahead Match only if the contents of the group come right after the match. For example, apple(?= orange) would match 'apple' when the text was 'apple orange' but nothing if the text was just 'apple.' Basically, "Match, but only if this comes right after." Nothing in the lookahead is included in the match.
(?!...) Negative Lookahead The opposite of a Positive Lookahead. Basically, "Don't match if this comes right after."

Building a Regular Expression

Walkthrough for Structured Text (ex. directories, phone books, anything with a strict format)

Let's say we've got document output that looks like this:

Charles, Mary Ann 22 July 1965

de la Cruz, Lucas 5 August 2004

The format may not be pretty to human eyes, but it's perfect for regex! Below are some steps to building a regular expression.

First, figure out the data you want to capture, and make groups for each piece. Looking at the lines, we see each one has the structure 'firstName, lastName day month year'. If we want to store each piece of information separately, we'll need a named capturing group for each one.

Our starting regex looks like this: (?P<lastName>)(?P<firstName>)(?P<day>)(?P<month>)(?P<year>)

Second, look at the format of each group (and pay attention to when they start and stop).

  • lastName: Starts at the beginning of each line. May be one or multiple words, which may be capitalized or non-capitalized. Stops at the comma that separates it from the first name.
  • firstName: Starts after the comma that follows lastName. May be multiple words, and to be safe, we'll assume there are both capitalized and non-capitalized words. Stops right before the day.
  • day: Starts right after firstName. Will consist of a single 2-digit number. Stops right before the month.
  • month: Starts right after day. Will consist of a single capitalized word. Stops right before the year.
  • year: Starts right after month. Will consist of a single 4-digit number. Stops at the end of the line.

Third, fill each group with regex that will match the data.

  • lastName: multiple capitalized/uncapitalized words can be captured with [A-Za-z ]+. We don't need to worry about leading or trailing spaces because it starts at the beginning of the line and ends at a comma.
  • firstName:

Example 1:

Let's say we need to capture all the canned food from a grocery list. Here's and example of the list:

canned corn, cabbage, onions, canned refried beans, bread

We see that all canned food starts with "canned" and ends just before the ",".

  • We can capture the start of our target with "canned ".
  • Next, we can need to capture all words follow (letters a-z and spaces). This can be done with "[a-z ]+"
  • Lastly, we need to end at ",". This is automatically done with "[a-z ]+" because "," isn't in our range, so it can't be captured.
  • The final regex is "canned [a-z ]+"

We can make this a named capture group by converting it to "(?P<cannedFood>canned [a-z ]+)". If you want to see this in action, go to regex101 and it'll run it and explain each aspect of the expression.

Example 2:

Testing Regex

Using in Python

Compiling into a Regular Expression Object

name_of_regex = re.compile(r'your regex here')

ex. image_re = re.compile(r'-image:(?P<imageName>[^\n]+)')

Finding Matches with Regular Expression Objects

Here are a few common methods for running your regex (after you've compiled it into a regular expression object as described above, referred to as Pattern below). The information comes from these docs, which have much more information.

Pattern.match(): If the regex matches at the beginning of the string, return that match as a match object (or return None if no match is present). When you want to search for a match anywhere in the string, use search() instead.

Pattern.search(): Returns the first match in the text as a match object (returns None if there are no matches). Example: using date_match = date_re.search(text) would set date_match to the first match from the regex date_re in the variable text

Pattern.findall(): Returns all non-overlapping matches in the string, in a few different formats depending on the situation:

  • Regex does not contain any capturing groups: returns a list of strings, where each string is the entire match to the regex
  • Regex contains one capturing group: returns a list of strings, where each string is only the capturing group part of the corresponding match
  • Regex contains multiple capturing groups: returns a list of tuples, where each tuple contains all the capturing groups of the match as strings

Using Match Objects

match() and search() return match objects. Below are some of the common methods of match objects from these docs. Here's some example code to use as we go:

name_re = re.compile(r'(?P<firstName>[A-Z][a-z]+) (?P<lastName>[A-Z][a-z]+)')

match = name_re.match('Albert Einstein')

Match.group(): returns capturing groups from the match in a few different ways:

  • match.group() or match.group(0): returns the whole match, ex. 'Albert Einstein'

  • match.group(1), match.group(2), etc: returns the specified capturing group from the match. group(1) refers to the first capturing group from the regex, firstName, so match.group(1) would return 'Albert'

  • match.group(1,2): returns a tuple containing all the specified capturing groups, ex. ('Albert', 'Einstein')

  • match.group('nameOfCapturingGroup'): returns the capturing group with the specified name, ex. match.group('lastName') would return 'Einstein'

Match.groups(): returns a tuple containing all the capturing groups

Match.groupdict(): returns a dictionary containing all the named capturing groups and the text they matched with, ex. {'firstName': Albert, 'lastName': Einstein}

Helpful Regex to Steal

No need to reinvent the wheel...here's some basic regex you can copy and alter to fit your needs.

Date Regex

Matches to a date in '[month] [day], [year]' format with both full and abbreviated month names, ex. "January 22, 1965" or "Aug. 3, 2006". Named capturing groups are called 'month', 'day', and 'year'.

(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)\.? (?P<day>\d{1,2}), (?P<year>\d{4})

Name Regex

Matches to a person's name (and inadvertently to other proper nouns, but that's regex for you). Can capture names with initials (but there must be a first name), apostrophes and multiple capitalized letters in the surname (ex. O'Brian), and nicknames (ex. 'Slim' Jim Sullivan, Gladys "Glad" MacDonald). Named capturing groups are called 'firstNames' and 'lastName'.

(?P<firstNames>(?:[\'\"]?[A-Z][\'\"A-Za-z]+ )+(?:[A-Z]. )*)(?P<lastName>[A-Z][\'A-Za-z]+)

City, State Regex

Matches to a location in the 'City, State' format (ex. "Salt Lake City, Utah" or "Charleston, West Virginia"). Can capture cities and states that are multiple words long. Doesn't capture names with uncapitalized words or most special characters, so add functionality if needed. Named capturing groups are called 'city' and 'state'.

(?P<city>(?:[A-Z][-\'A-Za-z ]+)+), (?P<state>(?:\s?[A-Z][-\'A-Za-z]+)+)

Proper Nouns (2+ Words)

Matches to proper nouns if they are two words or longer, which prevents matching to the capitalized word at the beginning of every sentence. (Replace '{2,}' with '+' if you would like to match to every group of capitalized words, including groups of 1.) May accidentally pick up the beginning words anyway if a proper noun follows -- ex. would match to "On Tuesday" or "Every Christmas". Halts at all punctuation except an apostrophe at the moment; this can be changed to suit your needs.

(?P<properNouns>(?:[A-Z][A-Za-z']* ??){2,})

⚠️ **GitHub.com Fallback** ⚠️