網頁爬蟲 - LZerApp/crawlerenv GitHub Wiki

目錄 (🔎 點擊展開/關閉)

前言

爬蟲原理

簡單爬蟲

Requests

GET Request with Params

import requests
params = {
    'key': self.access_key,
    'part': 'snippet,replies',
    'videoId': self.video_id,
    'maxResults': 100
}
url = f'https://www.googleapis.com/youtube/v3/commentThreads/?{urllib.parse.urlencode(params)}'
request = requests.get(url, params=params)

GraphQL

注意在請求 GraphQL API 時,需要使用 POST 方法並以 json 格式攜帶請求內容:

import requests


url = "https://api.github.com/graphql"
query = """
    mutation CreateCustomer($input:CustomerInput){
        customerCreate(customerData: $input){
            customer{
                name
            }
        }
    }
"""
variables = {'input': customer}

request = requests.post(url, json={'query': query, 'variables': variables})

BeautifulSoup

Splash

GET Request with Params

import urllib
params = {
    'key': self.access_key,
    'part': 'snippet,replies',
    'videoId': self.video_id,
    'maxResults': 100
}
url = f'https://www.googleapis.com/youtube/v3/commentThreads/?{urllib.parse.urlencode(params)}'
request = scrapy.Request(url, callback=self.parse)
yield request

Selenium

爬蟲框架

進階爬蟲

參考資料

⚠️ **GitHub.com Fallback** ⚠️