JavaScript The Good Parts

Chapter 7 Regular Expressions

An Example

In a JavaScript program, the regular expression must be on a single line. Whitespace is significant:

var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;
var url = “http://www.ora.com:80/goodparts?q#fragment”;
var names = ['url', 'scheme', 'slash', 'host', 'port',
    'path', 'query', 'hash'];
var blanks = '       ';
var i;
for (i = 0; i < names.length; i += 1) {
    document.writeln(names[i] + ':' +
        blanks.substring(names[i].length), result[i]);
}
This produces:
url:    http://www.ora.com:80/goodparts?q#fragment
scheme: http
slash:  //
host:   www.ora.com
port:   80
path:   goodparts
query:  q
hash:   fragment

(?:([A-Za-z]+):)? This factor matches a scheme name, but only if it is followed by a : (colon).

The (?:...) indicates a noncapturing group.
The suffix ? indicates that the group is optional.It means repeat zero or one time.
The (...) indicates a capturing group. A capturing group copies the text it matches and places it in the result array. Each capturing group is given a number. This first capturing group is 1, so a copy of the text matched by this capturing group will appear in result[1].
The [...] indicates a char-acter class. This character class, A-Za-z, contains 26 uppercase letters and 26 lower-case letters. The hyphens indicate ranges, from A to Z. Thesuffix + indicates that the character class will be matched one or more times. The group is followed by the : character, which will be matched literally.

**Capturing group 2 (\) : ** (/{0,3})

/ indicates that a / (slash) character should be matched.
{0,3} indicates that the / will be matched 0 or 1 or 2 or 3 time .

Caputring group 3 ( host name): ([0-9.-A-Za-z]+)

Made up of one or more digits, letters, or . or –.

**Caputring group 4 (port num): ** (?::(\d+))?

?:: a sequence of one or more digits preceded by a :
\d+ \d presents a digit character.

Caputuring group 5(): (?:/([^?#]*))?

[^?#]* indicates that the class inlcudes all characters except ? and #.
(?: ) mean another optional group.
*** ** characters class is matched zero or more times.

Note that I am being sloppy here. The class of all characters except ? and # includes line-ending characters, control characters, and lots of other characters that really shouldn’t be matched here. Most of the time this will do what we want, but there is a risk that some bad text could slip through. Sloppy regular expressions are a popular source of security exploits. It is a lot easier to write sloppy regular expressions than rigorous regular expressions.

Capturing group 6, (?:?([^#]*))?

begins with a ?.
contains zero or more characters that are not #.

Final optional group: (?:#(.*))?

Begin with # .
The . wil match any character except a line-ending character($).

**Another example: **

a regular expression that matches numbers. Numbers can have an integer part with an optional minus sign, an optional fractional part, and an optional exponent part.

var parse_number = /^-?\d+(?:\.\d*)?(?:e[+\-]?\d+)?$/i;
var test = function (num) {
   document.writeln(parse_number.test(num));
};
test('1');                // true
test('number');           // false
test('98.6');             // true
test('132.21.86.100');    // false
test('123.45E-67');       // true
test('123.45D-67');       // false

/^ $/i We again use ^ and $ to anchor the regular expression. This causes all of the charac-ters in the text to be matched against the regular expression. If we had omitted the anchors, the regular expression would tell us if a string contains a number. With the anchors, it tells us if the string contains only a number. If we included just the ^, it would match strings starting with a number. If we included just the $, it would match strings ending with a number. We could have written the e factor as [Ee] or (?:E|e), but we didn’t have to because we used the i flag.

-? indicates that the minus sign is optional.
(?:\.\d*)? The group will match a decimal point followed by zero or more digits.

Construction

Regular expressions

var my_regexp = /"(?:\\.|[^\\\"])*"/g;

Table 7-1. Flags for regular expressions

flag	Meaning
g	Global (match multiple times; the precise meaning of this varies with the method)
i	Insensitive (ignore character case)
m	Multiline (^ and $ can match line-ending characters)

RegExp constructor It is usually necessary to double the backslashes and escape the quotes:

var my_regexp = new RegExp("'(?:\\\\.|[^\\\\\\'])*'", 'g'));

The RegExp constructor is use-ful when a regular expression must be generated at runtime using material that is not available to the programmer. Table 7-2. Properties of RegExp objects

Property	User
gloabal	true if the g flag was used.
ignoreCase	true if the i flag was used.
lastIndex	The index at which to start the next exec match. Initially it is zero.
multiline	true if the m flag was used.
source	The souce text of the regular expression.

RegExp objects made by regular expression literals share a single instance:

function make_a_matcher( ) {
    return /a/gi;
}
var x = make_a_matcher( );
var y = make_a_matcher( );
// Beware: x and y are the same object!
x.lastIndex = 10;
document.writeln(y.lastIndex);    // 10

Elements

Regexp Choice

"into".match(/in|int/)

It wouldn’t match int because the match of in was successful.

Regexp Factor

\ / [ ] ( ) { } ? + * | . ^ $

must be escaped with a \ prefix if they are to be matched literally.
An unescaped . matches any character except a line-ending character.
An unescaped ^ matches the beginning of the text when the lastIndex property is zero. It can also match line-ending characters when the m flag is specified.
An unescaped $ matches the end of the text. It can also match line-ending characters when the m flag is specified.

Regexp Escape

In regexp factors, \b is not the backspace character.
\d is the same as [0-9]. It matches a digit. \D is the opposite: [^0-9].
\w is the same as [0-9A-Z_a-z]. \W is the opposite: [^0-9A-Z_a-z].
A simple letter class is [A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]. It includes all of Uni-code’s letters, but it also includes thousands of characters that are not letters.
\w to find word boundaries, so it is completely useless for multilingual applications.
\1 is a reference to the text that was captured by group 1

For example, you could search text for duplicated words with: var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1/gi; doubled_words looks for occurrences of words (strings containing 1 or more letters) followed by whitespace followed by the same word. \2 is a reference to group 2, \3 is a reference to group 3, and so on.

Regexp Group There are four kinds of groups:

Capturing

Noncapturing

has a (?: prefix
faster performance
do not interface with the numbering of capturing group.

Positive lookahead

has a (?= prefix
It is like a noncapturing group except that after the group matches, the text is rewound to where the group started, effectively matching nothing. This is not a good part.

(?=pattern) 非获取匹配，正向肯定预查，在任何匹配pattern的字符串开始处匹配查找字符串，该匹配不需要获取供以后使用。例如，“Windows(?=95|98|NT|2000)”能匹配“Windows2000”中的“Windows”，但不能匹配“Windows3.1”中的“Windows”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。

Negative lookahead

has a (?!pattern)
It is like a positive lookahead group, except that it matches only if it fails to match. This is not a good part.

非获取匹配，正向否定预查，在任何不匹配pattern的字符串开始处匹配查找字符串，该匹配不需要获取供以后使用。例如“Windows(?!95|98|NT|2000)”能匹配“Windows3.1”中的“Windows”，但不能匹配“Windows2000”中的“Windows”。

Regexp Class

A regexp class is a convenient way of specifying one of a set of characters. For exam-ple, if we wanted to match a vowel, we could write (?:a|e|i|o|u), but it is more con-veniently written as the class [aeiou]. Classes provide two other conveniences.

ranges of characters can be specified. So, the set of 32 ASCII special characters:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

could be written as:

(?:!|"|#|\$|%|&|'|\(|\)|\*|\+|,|-|\.|\/|:|;|<|=|>|@|\[|\\|]|\^|_|` |\{|\||\}|~)

but is slightly more nicely written as:

[!-\/:-@\[-`{-~]

which includes the characters from ! through / and : through @ and [ through ` and { through ~. It is still pretty nasty looking.

the complementing of a class. If the first character after the [ is ^, then the class excludes the specified characters. So [^!-/:-@[-`{-~] matches any character that is not one of the ASCII special characters.

Regexp Class Escape

The rules of escapement within a character class are slightly different than those for a regexp factor. [\b] is the backspace character. Here are the special characters that should be escaped in a character class: - / [ \ ] ^

Regexp Quantiﬁer

So, /www/ matches the same as /w{3}/. {3,6} will match 3, 4, 5, or 6 times. {3,} will match 3 or more times.
? is the same as {0,1}. * is the same as {0,}. + is the same as {1,}.

Matching tends to be greedy, matching as many repetitions as possible up to the limit, if there is one. If the quantifier has an extra ? suffix, then matching tends to be lazy, attempting to match as few repetitions as possible. It is usually best to stick with the greedy matching.

JavaScript The Good Parts - Orange168/NotesOnReading GitHub Wiki