JRegeEX - rFronteddu/general_wiki GitHub Wiki
RegEX is a means of describing a set of strings using a subset of common characteristics.
String Literals
The most basic form of pattern matching supported by this API is the match of a string literal. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical.
Several metacaracters can be used to alter or literals are evaluated. The metacharacters supported by this API are: <([{^-=$!|]})?*+.> and there are two ways to force them to be treated as ordinary characters:
- precede the metacharacter with a backslash
- enclose it within \Q (which starts the quote) and \E (which ends it).
Some Metacaracters
Java's Pattern class documentation is a really handy reference for predefined character classes.
- . The period will match any character; it does not have to be a letter.
-
- When appended to a character or character class, it means 'one or more instances of the previous character'.
-
- When appended to a character or character class, it means 'zero or more instances of the previous character'.
- ^ if this is before a character class, it means you're matching the first character; however, if this is the first character inside a bracketed character class, it means negation/not. For example, ^[a].+ or ^a.+ matches any consecutive sequence of 2 or more characters starting with the letter a, and ^[^a].+ matches any consecutive sequence of 2 or more characters not starting with a.
- $ When appended to a character or character class, it means 'ends with the previous character'. For example, .+a$ will match a sequence of 2 or more characters ending in a.
Character class
The word "class" in the phrase "character class" does not refer to a .class file. In the context of regular expressions, a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.
Simple Classes
| Construct | Description | Example |
|---|---|---|
| [abc] | a, b, or c (simple class) | [bcr]at matches cat and rat |
| [^abc] | Any character except a, b, or c (negation) | |
| [a-zA-Z] | a through z, or A through Z, inclusive (range) | foo[1-5] matches foo5 |
| [a-d[m-p]] | a through d, or m through p: [a-dm-p] (union) | [0-9&&[345]] matches 3, 4, and 5, |
| [a-z&&[def]] | d, e, or f (intersection) | .. |
| [a-z&&[^bc]] | a through z, except for b and c: [ad-z] (subtraction) | .. |
| [a-z&&[^m-p]] | a through z, and not m through p: [a-lq-z] (subtraction) | .. |
- Simple Class: Overall match succeeds only when the first letter matches one of the characters defined by the class.
- Negation: Match all characters except those listed. Insert "^" at the beginning of the class.
- Ranges": "-" can be used to specify ranges such as [1-5] or [a-h]
- Intersection: "&&" to match characters common to intersected classes. Can be done with ranges too.
- Subtraction: You can use subtraction to negate one or more nested character classes
Predefined Classes
The Pattern API contains several useful predefined character classes, which offer convenient shorthands for commonly used regular expressions:
| Construct | Description |
|---|---|
| . | Any character (may or may not match line terminators) |
| \d | A digit: [0-9] |
| \D | A non-digit: [^0-9] |
| \s | A whitespace character: [ \t\n\x0B\f\r] |
| \S | A non-whitespace character: [^\s] |
| \w | A word character: [a-zA-Z_0-9] |
| \W | A non-word character: [^\w] |
Constructs beginning with a backslash are called escaped constructs. If you are using an escaped construct within a string literal, you must precede the backslash with another backslash for the string to compile. For example:
private final String REGEX = "\\d"; // a single digit
Quantifiers
Quantifiers allow you to specify the number of occurrences to match against. Quantifiers can also attach to Character Classes and Capturing Groups, such as [abc]+ (a or b or c, one or more times) or (abc)+ (the group "abc", one or more times)
| Greedy | Reluctant | Possessive | Meaning |
|---|---|---|---|
| X? | X?? | X?+ | X, once or not at all |
| X* | X*? | X*+ | X, zero or more times |
| X+ | X+? | X++ | X, one or more times |
| X{n} | X{n}? | X{n}+ | X, exactly n times |
| X{n,} | X{n,}? | X{n,}+ | X, at least n times |
| X{n,m} | X{n,m}? | X{n,m}+ | X, at least n but not more than m times |
Greedy quantifiers are considered "greedy" because they force the matcher to read in, or eat, the entire input string prior to attempting the first match. If the first match attempt (the entire input string) fails, the matcher backs off the input string by one character and tries again, repeating the process until a match is found or there are no more characters left to back off from. Depending on the quantifier used in the expression, the last thing it will try matching against is 1 or 0 characters.
The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string.
Finally, the possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.
Pattern and Matcher
For Java, the important classes to know are Pattern and Matcher. A Pattern object is a compiled representation of a regular expresson. To create a Pattern object, you must invoke one of its static compile methods (usually Pattern.compile(String regex)). You must then create a Matcher object that matches your Pattern object (compiled RegEx) to the String you want to check
Examples:
// This will match a sequence of 1 or more uppercase and lowercase English letters as well as spaces
String myRegExString = "[a-zA-Z\\s]+";
// This is the string we will check to see if our regex matches:
String myString = "The quick brown fox jumped over the lazy dog...";
// Create a Pattern object (compiled RegEx) and save it as 'p'
Pattern p = Pattern.compile(myRegExString);
// We need a Matcher to match our compiled RegEx to a String
Matcher m = p.matcher(myString);
// if our Matcher finds a match
if( m.find() ) {
// Print the match
System.out.println( m.group() );
}
Matches strings of one or more sequential alphanumeric characters:
String s = "Hello, Goodbye, Farewell";
Pattern p = Pattern.compile("\\p{Alpha}+");
Matcher m = p.matcher(s);
while( m.find() ){
System.out.println(m.group());
}