JavaScript The Good Parts - Orange168/NotesOnReading GitHub Wiki
In a JavaScript program, the regular expression must be on a single line. Whitespace is significant:
var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;
var url = “http://www.ora.com:80/goodparts?q#fragment”;
var names = ['url', 'scheme', 'slash', 'host', 'port',
'path', 'query', 'hash'];
var blanks = ' ';
var i;
for (i = 0; i < names.length; i += 1) {
document.writeln(names[i] + ':' +
blanks.substring(names[i].length), result[i]);
}
This produces:
url: http://www.ora.com:80/goodparts?q#fragment
scheme: http
slash: //
host: www.ora.com
port: 80
path: goodparts
query: q
hash: fragment
(?:([A-Za-z]+):)?
This factor matches a scheme
name, but only if
it is followed by a : (colon).
- The
(?:...)
indicates a noncapturing group. - The
suffix ?
indicates that the group is optional.It means repeat zero or one time. - The
(...)
indicates a capturing group.A capturing group copies
the text it matches and places it in the result array. Each capturing group is given a number. This first capturing group is 1, so a copy of the text matched by this capturing group will appear in result[1]. - The
[...] indicates a char-acter class
. This character class, A-Za-z, contains 26 uppercase letters and 26 lower-case letters. The hyphens indicate ranges, from A to Z. Thesuffix +
indicates that the character class will be matched one or more times. The group is followed by the : character, which will be matched literally.
**Capturing group 2 (\) : ** (/{0,3})
- / indicates that a / (slash) character should be matched.
- {0,3} indicates that the / will be matched 0 or 1 or 2 or 3 time .
Caputring group 3 ( host name): ([0-9.-A-Za-z]+)
- Made up of one or more digits, letters, or . or –.
**Caputring group 4 (port num): ** (?::(\d+))?
- ?:: a sequence of one or more digits preceded by a :
- \d+ \d presents a digit character.
Caputuring group 5(): (?:/([^?#]*))?
- [^?#]* indicates that the class inlcudes all characters except ? and #.
- (?: ) mean another optional group.
- *** ** characters class is matched zero or more times.
Note that I am being sloppy here. The class of all characters except ? and # includes line-ending characters, control characters, and lots of other characters that really shouldn’t be matched here. Most of the time this will do what we want, but there is a risk that some bad text could slip through. Sloppy regular expressions are a popular source of security exploits. It is a lot easier to write sloppy regular expressions than rigorous regular expressions.
Capturing group 6, (?:?([^#]*))?
- begins with a ?.
- contains zero or more characters that are not #.
Final optional group: (?:#(.*))?
- Begin with # .
- The . wil match any character except a line-ending character($).
**Another example: **
a regular expression that matches numbers. Numbers can have an integer part with an optional minus sign, an optional fractional part, and an optional exponent part.
var parse_number = /^-?\d+(?:\.\d*)?(?:e[+\-]?\d+)?$/i;
var test = function (num) {
document.writeln(parse_number.test(num));
};
test('1'); // true
test('number'); // false
test('98.6'); // true
test('132.21.86.100'); // false
test('123.45E-67'); // true
test('123.45D-67'); // false
/^ $/i
We again use ^ and $ to anchor the regular expression. This causes all of the charac-ters in the text to be matched against the regular expression. If we had omitted the anchors, the regular expression would tell us if a string contains a number. With the anchors, it tells us if the string contains only a number. If we included just the ^, it would match strings starting with a number. If we included just the $, it would match strings ending with a number.
We could have written the e factor as [Ee] or (?:E|e), but we didn’t have to because we used the i flag.
-
-?
indicates that the minus sign is optional. -
(?:\.\d*)?
The group will match a decimal point followed by zero or more digits.
Regular expressions
var my_regexp = /"(?:\\.|[^\\\"])*"/g;
Table 7-1. Flags for regular expressions
flag | Meaning |
---|---|
g | Global (match multiple times; the precise meaning of this varies with the method) |
i | Insensitive (ignore character case) |
m | Multiline (^ and $ can match line-ending characters) |
RegExp constructor It is usually necessary to double the backslashes and escape the quotes:
var my_regexp = new RegExp("'(?:\\\\.|[^\\\\\\'])*'", 'g'));
The RegExp constructor is use-ful when a regular expression must be generated at runtime using material that is not available to the programmer. Table 7-2. Properties of RegExp objects
Property | User |
---|---|
gloabal | true if the g flag was used. |
ignoreCase | true if the i flag was used. |
lastIndex | The index at which to start the next exec match. Initially it is zero. |
multiline | true if the m flag was used. |
source | The souce text of the regular expression. |
RegExp objects made by regular expression literals share a single instance:
function make_a_matcher( ) {
return /a/gi;
}
var x = make_a_matcher( );
var y = make_a_matcher( );
// Beware: x and y are the same object!
x.lastIndex = 10;
document.writeln(y.lastIndex); // 10
Regexp Choice
"into".match(/in|int/)
It wouldn’t match int because the match of in was successful.
Regexp Factor
\ / [ ] ( ) { } ? + * | . ^ $
- must be escaped with a \ prefix if they are to be matched literally.
- An unescaped . matches any character except a line-ending character.
- An unescaped ^ matches the beginning of the text when the lastIndex property is zero. It can also match line-ending characters when the m flag is specified.
- An unescaped $ matches the end of the text. It can also match line-ending characters when the m flag is specified.
Regexp Escape
- In regexp factors, \b is not the backspace character.
- \d is the same as [0-9]. It matches a digit. \D is the opposite: [^0-9].
- \w is the same as [0-9A-Z_a-z]. \W is the opposite: [^0-9A-Z_a-z].
- A simple letter class is
[A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]
. It includes all of Uni-code’s letters, but it also includes thousands of characters that are not letters. - \w to find word boundaries, so it is completely useless for multilingual applications.
- \1 is a reference to the text that was captured by group 1
For example, you could search text for duplicated words with: var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/gi; doubled_words looks for occurrences of words (strings containing 1 or more letters) followed by whitespace followed by the same word. \2 is a reference to group 2, \3 is a reference to group 3, and so on.
Regexp Group There are four kinds of groups:
Capturing
Noncapturing
- has a (?: prefix
- faster performance
- do not interface with the numbering of capturing group.
Positive lookahead
- has a (?= prefix
- It is like a noncapturing group except that after the group matches, the text is rewound to where the group started, effectively matching nothing. This is not a good part.
(?=pattern) 非获取匹配,正向肯定预查,在任何匹配pattern的字符串开始处匹配查找字符串,该匹配不需要获取供以后使用。例如,“Windows(?=95|98|NT|2000)”能 匹配“Windows2000”中的“Windows”,但不能匹配“Windows3.1”中的“Windows”。预查不消耗字符,也就是说,在一个匹配发生后,在最后一次匹配之后立即开始下一次匹配的搜索,而不是从包含预查的字符之后开始。
Negative lookahead
- has a (?!pattern)
- It is like a positive lookahead group, except that it matches only if it fails to match. This is not a good part.
非获取匹配,正向否定预查,在任何不匹配pattern的字符串开始处匹配查找字符串,该匹配不需要获取供以后使用。例如“Windows(?!95|98|NT|2000)”能匹配“Windows3.1”中的“Windows”,但不能匹配“Windows2000”中的“Windows”。
Regexp Class
A regexp class is a convenient way of specifying one of a set of characters. For exam-ple, if we wanted to match a vowel, we could write (?:a|e|i|o|u), but it is more con-veniently written as the class [aeiou]. Classes provide two other conveniences.
- ranges of characters can be specified. So, the set of 32 ASCII special characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
could be written as:
(?:!|"|#|\$|%|&|'|\(|\)|\*|\+|,|-|\.|\/|:|;|<|=|>|@|\[|\\|]|\^|_|` |\{|\||\}|~)
but is slightly more nicely written as:
[!-\/:-@\[-`{-~]
which includes the characters from ! through / and : through @ and [ through ` and { through ~. It is still pretty nasty looking.
- the complementing of a class. If the first character after the [ is ^, then the class excludes the specified characters. So [^!-/:-@[-`{-~] matches any character that is not one of the ASCII special characters.
Regexp Class Escape
The rules of escapement within a character class are slightly different than those for a regexp factor. [\b] is the backspace character. Here are the special characters that should be escaped in a character class: - / [ \ ] ^
Regexp Quantifier
- So, /www/ matches the same as /w{3}/. {3,6} will match 3, 4, 5, or 6 times. {3,} will match 3 or more times.
- ? is the same as {0,1}. * is the same as {0,}. + is the same as {1,}.
Matching tends to be greedy, matching as many repetitions as possible up to the limit, if there is one. If the quantifier has an extra ? suffix, then matching tends to be lazy, attempting to match as few repetitions as possible. It is usually best to stick with the greedy matching.