Regex Notes - edorlando07/datasciencecoursera GitHub Wiki

BEGINNER EXAMPLES

1234
\d+
Finds any length of digits. The Regex code is a match.

1234 567890
^\d+$
The ^ indicates start of line. The $ indicates end of a line. The Regex code is not a match since the whole text needs to include digits.

1234567890
^\d+$
The Regex code is a match.

ABCD1234567890
^\d+$
The ^ indicates start of line. The $ indicates end of a line. The Regex code is not a match since the whole text needs to include digits.

1234
\b\d+\b
\b indicates the start or the end of a word boundary. The Regex code is a match.

1234 567890 \b\d+\b
\b indicates the start or the end of a word boundary. The Regex code is a match for 1234 and for 567890, independently.

ABCD1234
\b\d+\b
\b indicates the start or the end of a word boundary. The Regex code is a not match since numbers and letters are a part of the same string.

NY Postal Codes are 10001, 10002, 10003, 10004
\b\d+\b
\b indicates the start or the end of a word boundary. The Regex code is a match for 10001, 10002, 10003, 10004, independently.

Timestamp=20160502
\d{8}
{8} indicates the number of text or numerical digits to find. The Regex code is a match for 20160502.

Timestamp=20160502
\d{4}\d{2}\d{2}
{n} indicates the number of text or numerical digits to find. The Regex code is a match for 20160502.

Timestamp=20160502
(\d{4})(\d{2})(\d{2})
{n} indicates the number of text or numerical digits to find.
() Indicate independent groups of patters.
The Regex code is a match for 2016 05 02.

Timestamp=20160502
(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})
{n} indicates the number of text or numerical digits to find.
() Indicate independent groups of patters.
?P allows you to name groups so they are identified as year, month, day
The Regex code is a match for 2016 05 02.
2016 = year, 05 = month, 02 = day.

Widget Unit cost: 12,000.56 dollars
Taxes: 234.00 dollars
Total: 12,234.56 dollars
(?P<value>\d+(,\d{3})*(\.\d{2})?)\s+dollar(s)?
value for 1st line is 12,000.56
value for 2nd line is 234.00
value for 3rd line is 12,234.56

SPLIT LINES

Here is the list...1.soccer 2. tennis 3.basketball 4. cricket
\d+\.\s*
s* equals white space
The split would provide the following:
Here is the list...
soccer
tennis
basketball
cricket

CASE INSENSITIVE OPTION

this is a test
(?i)A
The (?i) will allow the Regex to find all As regardless of case

SPECIFY AN OR CONDITION (USE PIPE DELIMITERS)

this is the biggest test
b|i|g
[big]
Both these codes will find all Bs, Is, and Gs, independently.

FIND ALL LETTERS THAT A NON-VOWELS AND WHITE-SPACE

this ^ is a big test
[^aeiou ]
Carrot symbol needs to be at beginning to be a negator.

FIND LETTERS WITHIN A RANGE

this is a definitive test
[a-d]

FIND LETTERS WITHIN MULTIPLE RANGES

x-ray 3 won't for this test
[a-dx-z0-3]

FIND LETTERS THAT ARE NOT WITHIN MULTIPLE RANGES

x-ray 3 won't for this test
[^a-dx-z0-3] Use ^ as negator

FIND . WITHIN TEXT (FULL STOPS)

this. is. a. test
\.
[.]

FIND WITHIN TEXT

this is a test
\t

WORD BOUNDARIES

catalog of log
\blog\b
matches only the word log

FIND APPLE AT BEGINNING OF THIS SENTENCE

apple grows on apple trees
^apple

FIND EVERY APPLE AT THE BEGINNING OF THESE SENTENCES

apple 1 grows on apple trees
apple 2 grows on apple trees
(?m)^apple
Need to turn on multi-line mode comparison

FIND APPLE AT END OF THIS SENTENCE

apple grows on apple
apple$

FIND ALL APPLES AT END OF BOTH LINES

apple grows on apple
apple grows on apple
(?m)apple$

FIND ALL APPLES AT END OF BOTH LINES WITH MIXED CHARACTER (CASE SENSITIVE)

apple grows on aPple
apple grows on appLE
(?mi)apple$

FIND ALL DIGITS INDEPENDENTLY

ABCD 123456789
[0-9] OR
\d
shortcut for decimal digit (matches UNICODE digits in all languages)

FIND ALL NON-DIGITS

ABCD 123456789
\D

FIND ALL WORD CHARACTERS INCLUDING NUMBERS IN ALL LANGUAGES

ABCD 123456789
\w

FIND ALL NON-WORD CHARACTERS (SPACES, PERIODS)

ABCD 123456789
\W

FIND ALL WHITE SPACE (TAB, SPACE, CARRIAGE RETURN, NEW LINE)

One tab _ )
Two Three
\s

MATCH ONE LETTER THAT IS FOLLOWED BY A DIGIT

a
a2
a345
678
[a-z][0-9]
Finds a2 and a3

MATCH ONE LETTER THAT IS FOLLOWED BY ZERO OR MORE DIGITS

a
a2
a345
678
[a-z][0-9]*
Finds a, a2 and a345

MATCH ONE LETTER THAT IS FOLLOWED BY ONE OR MORE DIGITS

a
a2
a345
678
[a-z][0-9]+
Finds a2 and a345

MATCH ONE LETTER THAT IS FOLLOWED BY DIGITS OPTIONALLY

a
a2
a345
678
[a-z][0-9]?
Finds a, a2 and a345

MATCH TWO LETTERS THAT IS FOLLOWED BY ONE OR MORE DIGITS

a
a2
a345
678
ab1
ab234
[a-z]{2}[0-9]?
Finds ab1, ab234

MATCH 2-4 LETTERS THAT IS FOLLOWED BY ONE OR MORE DIGITS

a
a2
a345
678
ab1
ab234
abc56
abcd678
abcde789
[a-z]{2,4}[0-9]?
Finds ab234, abc56, abcd678, bcde789

CONDITIONAL STATEMENTS

123
abcd
1234
\b(\d{3}|[a-z]{4})\b
matches any 3 digit text or 4 character text only.