Basics of scrapy - HidenobuTokuda/scrapy-tabula-tt GitHub Wiki
The following command creates template files for a new project.
scrapy startproject <project-name> [project-dir]
Then, the folliwng command creates template files for a spider. The [-t crawl]
option adds a default rule for crawling.
scrapy genspider [-t <template(crawl)>] <spider-name> <target-domain>
-
Define item names to retrieve information in the
<project-name>Item class.
-
In the follwing example(items.py), two items (
file_urls
andfile_names
) were added.
import scrapy
class ProjectItem(scrapy.Item):
file_urls = scrapy.Field()
file_names = scrapy.Field()
- Define the rule of crawling and scraping in the
<spider-name> class.
- In the follwing example(CB_PDF.py),
rules
andparse_item
were modified. In this example, CSS selector is used, but XPath is also available for scrapy.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from project.items import ProjectItem
class CbPdfSpider(CrawlSpider):
name = 'CB-PDF'
allowed_domains = ['central-bank.org.tt']
start_urls = ['http://central-bank.org.tt/publications/latest-reports']
rules = (
Rule(LinkExtractor(restrict_css='article>div>div>p a'), callback='parse_item'),
)
def parse_item(self, response):
i = 0
for quote in response.css('article'):
if i == 0:
item = ProjectItem()
file_url = quote.css('a[data-entity-type="file"]::attr(href)').get()
file_url = response.urljoin(file_url)
item['file_urls'] = [file_url]
item['file_names'] = file_url.split("/")[-1]
i += 1
yield item
- Modify some settings (
DEPTH_LIMIT
is important). - In the following example(settings.py),
DOWNLOAD_DELAY
andDEPTH_LIMIT
were modfied (Other modification would be discussed later. It's not necessary at this stage.)
DOWNLOAD_DELAY = 1
DEPTH_LIMIT = 3
There are two ways to run the crawler: i. run from command line or ii. run from script file.
The following command runs the crawler. With -o <file-name>
command, scraped result will be saved in a specified format. With -o stdout:
command, result will be displayed in the command prompt.
scrapy crawl <spider-name> [-o <file-name or stdout:>] [-t <format(csv)> ] [--nolog]
<spider-name>
is not a name of the spider py file, but a name defined in a spider class (e.g. name = '<spider-name>'
).
Create a script file like this (crawl_test.py):
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
settings.set('FEED_FORMAT', 'csv')
settings.set('FEED_URI', '<output-filename>.csv')
process = CrawlerProcess(settings)
process.crawl('<spider-name>')
process.start()
Then, run this file by the following command:
python <crawl-filename>.py
scrapy shell <url>
response.css('<CSS selector>').extract_first()
exit()