scrapyNote - juedaiyuer/researchNote GitHub Wiki

scrapy初步笔记

创建工程

$ scrapy startproject my_crawler

该命令会在当前目录下创建一个名为”my_crawler”的工程,工程的目录结构如下

$ tree my_crawler
my_crawler
├── my_crawler
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

2 directories, 7 files

文件目录说明

scrapy.cfg				# deploy configuration file

my_crawler/			    # project's Python module, you'll import your code from here
	__init__.py
	items.py 			# project items definition file
	pipelines.py 		# project pipelines file
	settings.py 		# project settings file
	spiders/
		__init__.py		# a directory where you'll later put your spiders

设置待抓取内容的字段

修改items.py文件,在MyCrawlerItem类中加上如下代码

import scrapy

class MyCrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()    # 文章标题
    url = scrapy.Field()      # 文章地址
    summary = scrapy.Field()  # 文章摘要
    pass

编写网页解析代码

在”my_crawler/spiders”目录下,创建一个名为”crawl_spider.py”文件(文件名可以任意取)。代码如下

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from my_crawler.items import MyCrawlerItem

class MyCrawlSpider(CrawlSpider):
    name = 'my_crawler'               # Spider名,必须唯一,执行爬虫命令时使用
    allowed_domains = ['bjhee.com']   # 限定允许爬的域名,可设置多个
    start_urls = [
        "http://www.bjhee.com",       # 种子URL,可设置多个
    ]

    rules = (    # 对应特定URL,设置解析函数,可设置多个
        Rule(LinkExtractor(allow=r'/page/[0-9]+'),  # 指定允许继续爬取的URL格式,支持正则
                           callback='parse_item',   # 用于解析网页的回调函数名
                           follow=True
        ),
    )

    def parse_item(self, response):
        # 通过XPath获取Dom元素
        articles = response.xpath('//*[@id="main"]/ul/li')

        for article in articles:
            item = MyCrawlerItem()
            item['title'] = article.xpath('h3[@class="entry-title"]/a/text()').extract()[0]
            item['url'] = article.xpath('h3[@class="entry-title"]/a/@href').extract()[0]
            item['summary'] = article.xpath('div[2]/p/text()').extract()[0]
            yield item

官方文档的解释

name identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

对于XPath不熟悉

测试爬虫效果

$ scrapy crawl my_crawler

抓取的内容保存为JSON文件

$ scrapy crawl my_crawler -o my_crawler.json -t json

在当前目录下,找到文件my_crawler.json,里面保存的就是我们要抓取的字段信息

source