Website: Web_Crawling.doc - ZhaochengLi/Zhaocheng-s GitHub Wiki

本期学习内容来自于: 莫烦PYTHON, 由衷感谢老师的Tutorial!!!!

内容完全用于个人学习,无商业用途,如有关注,请多多了解 莫烦PYTHON. 一位专业的教学up主!

Outline:


Why Web-crawling?

  • 信息采集
  • 信息处理,e.g.,可视化处理,寻找潜在联系

了解网页结构

  • 将要大量运用HTML. CSS 以及 JavaScript 也会提到。

  • For HTML language, its structure is made of massive elements, such as <head>, and <body>. All the contents will be covered inside elements, like

      <head> 
          ... contents ... 
      </head>. 
    
  • What web-crawling mainly does is to catch those elements for information.

  • We will use Python to accomplish it. There are two steps.

    • Use Python to browse the source code of a website;
    • Match the elements in source code by using Regular Explression of Python. This is a preferable method for entry-level matching only. For more advanced needs, we will use BeautifulSoup.

解析网页: 基础

解析网页: CSS

解析网页: 正则表达

  • 为了更好理解和实际运用,我们将剩下课题放下Jupyter Notebook的project下运行。
⚠️ **GitHub.com Fallback** ⚠️