scrapyNote - juedaiyuer/researchNote GitHub Wiki
scrapy初步笔记
创建工程
$ scrapy startproject my_crawler
该命令会在当前目录下创建一个名为”my_crawler”的工程,工程的目录结构如下
$ tree my_crawler
my_crawler
├── my_crawler
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
2 directories, 7 files
文件目录说明
scrapy.cfg # deploy configuration file
my_crawler/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/
__init__.py # a directory where you'll later put your spiders
设置待抓取内容的字段
修改items.py文件,在MyCrawlerItem类中加上如下代码
import scrapy
class MyCrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() # 文章标题
url = scrapy.Field() # 文章地址
summary = scrapy.Field() # 文章摘要
pass
编写网页解析代码
在”my_crawler/spiders”目录下,创建一个名为”crawl_spider.py”文件(文件名可以任意取)。代码如下
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from my_crawler.items import MyCrawlerItem
class MyCrawlSpider(CrawlSpider):
name = 'my_crawler' # Spider名,必须唯一,执行爬虫命令时使用
allowed_domains = ['bjhee.com'] # 限定允许爬的域名,可设置多个
start_urls = [
"http://www.bjhee.com", # 种子URL,可设置多个
]
rules = ( # 对应特定URL,设置解析函数,可设置多个
Rule(LinkExtractor(allow=r'/page/[0-9]+'), # 指定允许继续爬取的URL格式,支持正则
callback='parse_item', # 用于解析网页的回调函数名
follow=True
),
)
def parse_item(self, response):
# 通过XPath获取Dom元素
articles = response.xpath('//*[@id="main"]/ul/li')
for article in articles:
item = MyCrawlerItem()
item['title'] = article.xpath('h3[@class="entry-title"]/a/text()').extract()[0]
item['url'] = article.xpath('h3[@class="entry-title"]/a/@href').extract()[0]
item['summary'] = article.xpath('div[2]/p/text()').extract()[0]
yield item
官方文档的解释
name identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
对于XPath不熟悉
测试爬虫效果
$ scrapy crawl my_crawler
抓取的内容保存为JSON文件
$ scrapy crawl my_crawler -o my_crawler.json -t json
在当前目录下,找到文件my_crawler.json,里面保存的就是我们要抓取的字段信息
source
- 使用Scrapy构建一个网络爬虫
- Scrapy 中文指南
- scrapy.pdf 官方文档 已归档