xpath advanced - Shuang0420/Shuang0420.github.io GitHub Wiki
<div class="page-ctrl ctrl-app" id="recommendListPage"><a href="http://appstore.huawei.com:80/more/all/1">首页</a> <a href="http://appstore.huawei.com:80/more/all/1"><em class="arrow-grey-lt"> </em>上一页</a> <a href="http://appstore.huawei.com:80/more/all/1">1</a><span>2</span> <a href="http://appstore.huawei.com:80/more/all/3">3</a> <a href="http://appstore.huawei.com:80/more/all/4">4</a> <a href="http://appstore.huawei.com:80/more/all/5">5</a> <a href="http://appstore.huawei.com:80/more/all/3">下一页<em class="arrow-grey-rt"> </em></a> <a href="http://appstore.huawei.com:80/more/all/41">尾页</a>
'//div[@class="page-ctrl ctrl-app"]/a/em[@class="arrow-grey-rt"]/../@href'
/AAA/XXX/preceding-sibling::* /AAA/XXX节点的所有之前同级节点
root.xpath('//price[text()>30]//preceding-sibling::title|following-sibling::title')
hxs.select('//a[contains(@href, "image")]/img/@src').extract()
前缀 命名空间 用途 re http://exslt.org/regular-expressions 正则表达式
>>> from scrapy import Selector
>>> doc = """
... <div>
... <ul>
... <li class="item-0"><a href="link1.html">first item</a></li>
... <li class="item-1"><a href="link2.html">second item</a></li>
... <li class="item-inactive"><a href="link3.html">third item</a></li>
... <li class="item-1"><a href="link4.html">fourth item</a></li>
... <li class="item-0"><a href="link5.html">fifth item</a></li>
... </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']
>>>