Understanding regex expressions in Java - Fish-In-A-Suit/Conquest GitHub Wiki
Tutorial: Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text
A regular expression defines a search pattern for strings. The abbreviation for regular expression is regex. The search pattern can be anything from a simple character, a fixed string or a complex expression containing special characters describing the pattern. The pattern defined by the regex may match one or several times or not at all for a given string.
Regular expressions can be used to search, edit and manipulate text.
Note that
- regular expressions are case-sensitive. Searching
BKANJBWA
forabc
won't return any matches. - pattern
.
matches any character except new line. SearchingPoo
with.
matchesPoo
; it's because dot is a special character. If you wanted to seach for a dot, then you would have to precede it with an escape character (\
) like this\.
. - All of the characters that need to be escaped like the dot are:
.[{()}]\$|?*+
Note that if you want to search for backslash (\
), you have to escape itself:\\
Types of characters that we can match are:
. - any character except new line
\d - digit (0-9)
\D - not a digit (0-9)
\w - word character (a-z, A-Z, 0-9, _)
\W - not a word character
\s - whitespace (space, tab, newline)
\S - not whitespace
Anchors (match invisible positions before or after characters:
\b - word boundary
\B - not a word boundary
^ - beginning of a string
$ - end of a string
[] - (character set) matches characters in brackets
[^ ] - matches characters not in brackets
| - either or
() - group
Quantifiers:
* - 0 or more (of what you are searching for)
+ - 1 or more
? - 0 or one
{3} - exact number
{3, 4} - range of numbers (minimum, maximum)
Example: searching string Ha HaHa
for word boundaries:
- pattern:
\bHa
matchesHa Ha
; there is a word boundary at the start of the line (firstHa
) and space is also a word boundary, so the firstHa
inHaHa
gets matched as well. LastHa
inHaHa
doesn't get matched because there is no word boundary between the two "Ha
"s - pattern
\BHa
matches only the lastHa
inHaHa
- using two word boundaries, like
\bHa\b
only matches the firstHa
in the string - if you'd want to match a
Ha
that was only at the beginning of a string, use^Ha
- if you'd want to match a
Ha
that was only ath the end of a string, use$Ha
Example: searching through telephone numbers 321-555-4321
and 123.555.1234
-
\d\d\d.\d\d\d.\d\d\d\d
: "search for three digits, allow any character, search another three digits, allow any character and search for four digits" goal: only match a phone number if it uses dash or dot as a separator --> use character sets -
\d\d\d[-.]\d\d\d[-.]\d\d\d\d
search for three digits, then allow dash or dot, search another three digits, allow dash or dot and search another four digits. Note: there is no need to escape specifal characters in character sets! Improvement with quantifiers:\d{3}[-.]\d{3}[-.]\d{4}
: achieves the same as the example before
Example: match telephone numbers only if they begin with 800 or 900, such as 800-555-4341
and 900-555-3456
:
-
[89]00[-.]\d\d\d[-.]\d\d\d\d
allow 8 or 9 to be the first character, then search for 0 and 0 following that, then allow dash or dot, then search three digits, allow dash or dot and search 4 digits.
Example: specify a range of characters that you want to search through:
- if you're a noob:
[1234567]
- if you're a pro:
[1-7]
-
[a-zA-Z]
searches through all lowercase a through z and all uppercase A through Z
If a caret (^) is used inside a character set, it negates everything and matches everything that is not in the set (even all of the whitespace): [^]
. If you'd wanted to match any character that is not lowercase letter: [^a-z]
.
Example: processing the following text
Mr.Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
-
Mr\.?\s[A-Z]\w*
: First, search forMr
. Then, allow a dot to or not to followMr
--> question mark allows 0 or none occurrences of the previous pattern being sought after (in this case, this is dot). Then, allow a spacebar and then search for a capital letter. Then, with the use of the asterisk, allow zero or more word characters to follow. Do note that if you were to use plus instead of the asterisk, therefore matching one or more word characters following the capital letter, Mr. T wouldn't be matched, since no characters follow after the capital letter. - Groups allow to match several different patterns. To create a group, use parantheses:
M(r|s|rs)\.?\s[A-Z]\w*
--> first, seek for anM
, and then allow eitherr
,s
orrs
to follow. Allow a dot to or not to follow (question mark). Continue with a whitespace character, search for a Capital letter and allow 0 or more word characters to follow.
Some examples from collada xml parser:
private static final Pattern DATA = Pattern.compile(">(.+?)<");
private static final Pattern START_TAG = Pattern.compile("<(.+?)>");
private static final Pattern ATTR_NAME = Pattern.compile("(.+?)=");
private static final Pattern ATTR_VAL = Pattern.compile("\"(.+?)\"");
private static final Pattern CLOSED = Pattern.compile("(</|/>)");
-
DATA: refers to the data found between the end of the starting tag and the start of the end tag of a node.
>(.+?)<
= match one or more (and zero or one) characters between>
and<
. Example of such data in .dae file:<float_array id="Cube-mesh-map-0-array" count="8520">0.7784072 0.465048 0.7570394 0.4704542 0.7588501 0.4620934 0.7588501</float_array>
. If this was matched against>(.+?)<
, all of the numbers between>
after count declaration and<
of the ending tag for a float array would be returned. -
START_TAG refers to all of the characters that are found between the starting and ending tags of a node (
<
and>
).<(.+?)>
= match one or more (and zero or one) characters between<
and>
. If this was compared against<float_array id="Cube-mesh-map-0-array" count="8520">0.7784072 0.465048 0.7570394 0.4704542 0.7588501 0.4620934 0.7588501</float_array>
, it would return<float_array id="Cube-mesh-map-0-array" count="8520"></float_array>
. -
ATTR_NAME refers to all of the characters that are found prior to equals sign (ie, the name of an attribute).
(.+?)=
= match one or more (and zero or one) characters from the input string prior to=
sign. -
ATTR_VAL refers to the value of the attribute (specified by ATTR_NAME). In Collada, the values of the attributes are enclosed withing "" - note that they have be escaped in java using backslashes, so that Java doesn't read such characters as String termination ones.
\"(.+?)\"
= match one or more (and zero or one) characters that are found between"
and"
. -
CLOSED refers to the closing tag. It matches either
</
or/>
characters:(</|/>)
Matching in Java using Pattern and Matcher:
In Java, performing regex operations is done using two classes: Pattern and Matcher. Pattern refers to a regular expression, specified as a string, which must first be compiled into an instance of the Pattern class (for example: Pattern dot = Pattern.compile(".");
) The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, and not in the pattern.