scrapyOfficialDocumentsTest - juedaiyuer/researchNote GitHub Wiki
#官方文档测试#
##创建工程##
scrapy startproject tutorial
##代码示例##
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
##测试运行##
To put our spider to work, go to the project’s top level directory and run:
$ scrapy crawl quotes
##提取数据##
Scrapy shell
scrapy shell 'http://quotes.toscrape.com/page/1/'
选择CSS
In [1]: response.css('title')
Out[1]: [<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
从上面的title中提取文本
In [2]: response.css('title::text').extract()
Out[2]: [u'Quotes to Scrape']
添加了::text用于css查询,直接选择了<title>元素里面的文本。如果没有::text,将会获得全部元素
In [3]: response.css('title').extract()
Out[3]: [u'<title>Quotes to Scrape</title>']
extract()返回的是一个list,处理的是一个实例SelectorList
In [4]: response.css('title::text').extract_first()
Out[4]: u'Quotes to Scrape'
直接获得文本的另一种方法
In [5]: response.css('title::text')[0].extract()
Out[5]: u'Quotes to Scrape'
extract_first()避免IndexError错误,当没有元素匹配选择时,返回None
lesson
当页面没有被抓,爬虫具有对错误弹性的性能,能够返回一些数据
使用正则表达式提取
In [6]: response.css('title::text').re(r'Quotes.*')
Out[6]: [u'Quotes to Scrape']
In [7]: response.css('title::text').re(r'Q\w+')
Out[7]: [u'Quotes']
In [8]: response.css('title::text').re(r'(\w+) to (\w+)')
Out[8]: [u'Quotes', u'Scrape']
##XPath##
除了CSS,scrapy selectors支持XPath expressions
In [9]: response.xpath('//title')
Out[9]: [<Selector xpath='//title' data=u'<title>Quotes to Scrape</title>'>]
In [10]: response.xpath('//title/text()').extract_first()
Out[10]: u'Quotes to Scrape'
##提取引用和作者##
$ scrapy shell 'http://quotes.toscrape.com'
In [1]: response.css("div.quote")
In [2]: quote = response.css("div.quote")[0]
In [3]: quote
Out[3]: <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>
提取title,author,tags从前面创建的quote对象
提取title
In [4]: title = quote.css("span.text::text").extract_first()
In [5]: title
Out[5]: u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'
提取author
In [6]: author = quote.css("small.author::text").extract_first()
In [7]: author
Out[7]: u'Albert Einstein'
提取tags
In [8]: tags = quote.css("div.tags a.tag::text").extract()
In [9]: tags
Out[9]: [u'change', u'deep-thoughts', u'thinking', u'world']
迭代所有的quotes元素,放入python字典
In [10]: for quote in response.css("div.quote"):
...: text = quote.css("span.text::text").extract_first()
...: author = quote.css("small.author::text").extract_first()
...: tags = quote.css("div.tags a.tag::text").extract()
...: print(dict(text=text, author=author, tags=tags))
...:
##提取数据##
回到之前的测试代码,它的主要作用是提取了整个网页的HTML,而不是特定化的抽取数据
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
重新运行即可
$ scrapy crawl quotes
##存储抓取的数据##
用Feed exports存储
scrapy crawl quotes -o quotes.json
##source##
- scrapy.pdf 官方文档 已归档