本期学习内容来自于: 莫烦PYTHON, 由衷感谢老师的Tutorial!!!!

内容完全用于个人学习，无商业用途，如有关注，请多多了解莫烦PYTHON. 一位专业的教学up主！

Regular Expression or RegEx, 一种匹配字符的工具

在网页爬虫中，我们将会经常运用的一种工具，比如我们要爬取网页中的每一页的标题，做一个开心的标题党，我们会遇到：

    <title>
       ... I am handsome ...
    </title>

我就能用正则表达式，进行匹配，从而一次性获取成千上万标题信息！

先看我

简单匹配
灵活匹配
按类型匹配
重复匹配
分组
findall
replace
split
compile

简单匹配

正则表达式的目的是在文本中寻找特定内容，比如在“dogs are cute”这句话中寻找是否存在“are”或者“cute”。但是正则表达式还能做更多，首先要运用到Python的module re。
```
  import re
```
re.search(text, target), returns None if the target does not exist in the text; otherwise returns the match object.

灵活匹配

这是正则表达式的独特之处，运用特殊的pattern来灵活匹配要找的文字。

如果我们需要找到潜在的多个可能性文字，我们要用到[ab], 意思是我想找的字符既可以是a也可以是b。

  // using r at the front of the string, to show it is RegEx instread of simple string.
  ptn = r"r[au]n" 
  re.search(ptn, "dog runs away")

这里还会有多种的pattern使用方式，我们将在这里进行详述

按类型匹配

除了自己定义外，还有很多预先定义好的pattern供使用：

pattern	rule
`\d`	任何数字
`\D`	不是数字
`\s`	任何 white space, i.e., `\t`, `\n`, `\r`, `\f`, `\v`
`\S`	不是 white space
`\w`	任何大小写字母，数字和 `""[a-zA-Z0-9]`
`\W`	不是`\w`
`\b`	空白字符，只在某个字的开头或结尾
`\B`	空白字符，不在某个字的开头或者结尾
`\\`	匹配`\`
`.`	匹配任何字符，除了`\n`
`^`	匹配开头
`$`	匹配结尾
`?`	前面的字符可有可无

注意，如果一个字符串有很多行，而我们想用^来匹配行开头的字符，运用重唱的形式是不成功的。比如I在下一行，但是r"^I"却匹配不到第二行，这是我们需要改变一下re.search的参数
```
  string="""
         happy day, 
         and beautiful 
         life.
         """
  re.search(r"^I", string) // None
  re.search(r"^I", string, flags=re.M) // or we can also flags=re.MULTILINE
```
For examples, look here

重复匹配

如果某个pattern重复使用，我们可以
- *, 重复零次或多次；
- +, 重复一次或多次；
- {n,m}, 重复n至m次；
- {n}, 重复n次；
For examples, here

分组

我们可以运用()来实现分组，轻松定位找到的内容，使用match.group(index)。

  match = re.search(r"(\d+), Date:(.+)", "ID:021523, Date: Feb/12/2017")
  match.group()  // 021523, Date: Feb/12/2017
  match.group(1) // 021523
  match.group(2) // Date: Feb/12/2017

有时候只用index会很难找到自己想要自己的组，这时候可以用名字来充当index，by using ?P<名字>

  match = re.search(r"(?P<id>\d+), Date:(?P<date>.+)", "ID:021523, Date: Feb/12/2017")
  match.group()       // 021523, Date: Feb/12/2017
  match.group("id")   // 021523
  match.group("date") // Date: Feb/12/2017

findall

所有之前的办法都是找到了匹配条件的一项而已，如果想要找到所以匹配项，我们要用findall来实现，符号|代表or的意思。

    // findall
    re.findall(r"r[ua]n", "run ran ren") // ["run", "ran"]
    
    // or, "|"
    re.findall(r"(run|ran)", "run ran ren") // ["run", "ran"]

replace

我们不仅能运用正则表达式匹配字符，而且还能替代这些字符，with re.sub(). Python 里的 buildin 程式中的string.replace()有类似功能。

    re.sub(r"r[au]ns", "catches", "dog runs to cat") // it becomes "dog catches to cat"

split

类似Python的buildin程式string.split(" "), 我们在这里也可以通过运用re.split()来进行分割。

    re.split(r",;\.", "a;b,c.d;e") // ["a", "b", "c", "d", "e"]

compile

我们还可以对正则表达式实现重复使用， with re.compile()

    comp = re.compile(r"r[ua]n")  // change the value here to reuse every time 
    comp.search("dog ran to cat") // show the match info

...END...

Website: RegEx_正则表达式.doc - ZhaochengLi/Zhaocheng-s GitHub Wiki

Regular Expression or RegEx, 一种匹配字符的工具

简单匹配

灵活匹配

按类型匹配

重复匹配

分组

findall

replace

split

compile

⚠️ GitHub.com Fallback ⚠️

Website: RegEx_正则表达式.doc - ZhaochengLi/Zhaocheng-s GitHub Wiki

Regular Expression or RegEx, 一种匹配字符的工具

简单匹配

灵活匹配

按类型匹配

重复匹配

分组

findall

replace

split

compile

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️