Basics of scrapy - HidenobuTokuda/scrapy-tabula-tt GitHub Wiki

I. Minimum basics of Scrapy

1. Create project

The following command creates template files for a new project.

scrapy startproject <project-name> [project-dir]

2. Create spider

Then, the folliwng command creates template files for a spider. The [-t crawl] option adds a default rule for crawling.

scrapy genspider [-t <template(crawl)>] <spider-name> <target-domain>

3. Modify three python files

i. Modify items.py

  • Define item names to retrieve information in the <project-name>Item class.

  • In the follwing example(items.py), two items (file_urls and file_names) were added.

import scrapy

class ProjectItem(scrapy.Item):
    file_urls = scrapy.Field()
    file_names = scrapy.Field()

ii. Modify <spider-name>.py

  • Define the rule of crawling and scraping in the <spider-name> class.
  • In the follwing example(CB_PDF.py), rules and parse_item were modified. In this example, CSS selector is used, but XPath is also available for scrapy.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from project.items import ProjectItem

class CbPdfSpider(CrawlSpider):
    name = 'CB-PDF'
    allowed_domains = ['central-bank.org.tt']
    start_urls = ['http://central-bank.org.tt/publications/latest-reports']

    rules = (
        Rule(LinkExtractor(restrict_css='article>div>div>p a'), callback='parse_item'),
    )

    def parse_item(self, response):
        i = 0
        for quote in response.css('article'):
            if i == 0:
                item = ProjectItem()
                file_url = quote.css('a[data-entity-type="file"]::attr(href)').get()
                file_url = response.urljoin(file_url)
                item['file_urls'] = [file_url]
                item['file_names'] = file_url.split("/")[-1]
                i += 1

            yield item

iii. Modify settings.py

  • Modify some settings (DEPTH_LIMIT is important).
  • In the following example(settings.py), DOWNLOAD_DELAY and DEPTH_LIMIT were modfied (Other modification would be discussed later. It's not necessary at this stage.)
DOWNLOAD_DELAY = 1
DEPTH_LIMIT = 3

4. run crawler

There are two ways to run the crawler: i. run from command line or ii. run from script file.

i. run from command line

The following command runs the crawler. With -o <file-name> command, scraped result will be saved in a specified format. With -o stdout: command, result will be displayed in the command prompt.

scrapy crawl <spider-name> [-o <file-name or stdout:>] [-t <format(csv)> ] [--nolog]

<spider-name> is not a name of the spider py file, but a name defined in a spider class (e.g. name = '<spider-name>' ).

ii. run from script file

Create a script file like this (crawl_test.py):

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

settings = get_project_settings()

settings.set('FEED_FORMAT', 'csv')
settings.set('FEED_URI', '<output-filename>.csv')

process = CrawlerProcess(settings)
process.crawl('<spider-name>')
process.start()

Then, run this file by the following command:

python <crawl-filename>.py

5. Debug

scrapy shell <url>
response.css('<CSS selector>').extract_first()
exit()
⚠️ **GitHub.com Fallback** ⚠️