scrapyOfficialDocumentsTest - juedaiyuer/researchNote GitHub Wiki

#官方文档测试#

##创建工程##

scrapy startproject tutorial

##代码示例##

import scrapy

class QuotesSpider(scrapy.Spider):
	name = "quotes"

	def start_requests(self):
        urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
        ]
		for url in urls:
			yield scrapy.Request(url=url, callback=self.parse)

	def parse(self, response):
		page = response.url.split("/")[-2]
		filename = 'quotes-%s.html' % page
		with open(filename, 'wb') as f:
			f.write(response.body)
	self.log('Saved file %s' % filename)

##测试运行##

To put our spider to work, go to the project’s top level directory and run:

$ scrapy crawl quotes

##提取数据##

Scrapy shell

scrapy shell 'http://quotes.toscrape.com/page/1/'

选择CSS

In [1]: response.css('title')
Out[1]: [<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

从上面的title中提取文本

In [2]: response.css('title::text').extract()
Out[2]: [u'Quotes to Scrape']

添加了::text用于css查询,直接选择了<title>元素里面的文本。如果没有::text,将会获得全部元素

In [3]: response.css('title').extract()
Out[3]: [u'<title>Quotes to Scrape</title>']

extract()返回的是一个list,处理的是一个实例SelectorList

In [4]: response.css('title::text').extract_first()
Out[4]: u'Quotes to Scrape'

直接获得文本的另一种方法

In [5]: response.css('title::text')[0].extract()
Out[5]: u'Quotes to Scrape'

extract_first()避免IndexError错误,当没有元素匹配选择时,返回None

lesson

当页面没有被抓,爬虫具有对错误弹性的性能,能够返回一些数据

使用正则表达式提取

In [6]: response.css('title::text').re(r'Quotes.*')
Out[6]: [u'Quotes to Scrape']

In [7]: response.css('title::text').re(r'Q\w+')
Out[7]: [u'Quotes']

In [8]: response.css('title::text').re(r'(\w+) to (\w+)')
Out[8]: [u'Quotes', u'Scrape']

##XPath##

除了CSS,scrapy selectors支持XPath expressions

In [9]: response.xpath('//title')
Out[9]: [<Selector xpath='//title' data=u'<title>Quotes to Scrape</title>'>]

In [10]: response.xpath('//title/text()').extract_first()
Out[10]: u'Quotes to Scrape'

##提取引用和作者##

$ scrapy shell 'http://quotes.toscrape.com'

In [1]: response.css("div.quote")

In [2]: quote = response.css("div.quote")[0]

In [3]: quote
Out[3]: <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>

提取title,author,tags从前面创建的quote对象

提取title

In [4]: title = quote.css("span.text::text").extract_first()

In [5]: title
Out[5]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'

提取author

In [6]: author = quote.css("small.author::text").extract_first()

In [7]: author
Out[7]: u'Albert Einstein'

提取tags

In [8]: tags = quote.css("div.tags a.tag::text").extract()

In [9]: tags
Out[9]: [u'change', u'deep-thoughts', u'thinking', u'world']

迭代所有的quotes元素,放入python字典

In [10]: for quote in response.css("div.quote"):
    ...:     text = quote.css("span.text::text").extract_first()
    ...:     author = quote.css("small.author::text").extract_first()
    ...:     tags = quote.css("div.tags a.tag::text").extract()
    ...:     print(dict(text=text, author=author, tags=tags))
    ...:

##提取数据##

回到之前的测试代码,它的主要作用是提取了整个网页的HTML,而不是特定化的抽取数据

import scrapy

class QuotesSpider(scrapy.Spider):
	name = "quotes"

	def start_requests(self):
        urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
        ]
		for url in urls:
			yield scrapy.Request(url=url, callback=self.parse)

	def parse(self, response):
      for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').extract_first(),
            'author': quote.css('span small::text').extract_first(),
            'tags': quote.css('div.tags a.tag::text').extract(),
        }

重新运行即可

$ scrapy crawl quotes

##存储抓取的数据##

用Feed exports存储

scrapy crawl quotes -o quotes.json

##source##

  • scrapy.pdf 官方文档 已归档
⚠️ **GitHub.com Fallback** ⚠️