Big File Spider

class kingfisher_scrapy.base_spiders.big_file_spider.BigFileSpider(*args, **kwargs)[source]

This class makes it easy to collect data from a source that publishes very large packages. Each package is split into smaller packages, each containing 100 releases or records. Users can then process the files without using an iterative parser and without having memory issues.

  1. Inherit from BigFileSpider

  2. Write a start_requests() method to request the archive files

from kingfisher_scrapy.base_spiders import BigFileSpider
from kingfisher_scrapy.util import components

class MySpider(BigFileSpider):
    name = 'my_spider'

    def start_requests(self):
        yield self.build_request('https://example.com/api/package.json', formatter=components(-1)

Note

concatenated_json = True, line_delimited = True and root_path are not supported, because this spider yields items whose data field has package and data keys.

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Self

resize_package = True
classmethod from_crawler(crawler, *args, **kwargs)[source]
parse(response)[source]