Compressed File Spider#

class kingfisher_scrapy.base_spiders.compressed_file_spider.CompressedFileSpider(*args, **kwargs)[source]#

This class makes it easy to collect data from ZIP or RAR files. It assumes all files have the same data type. Each compressed file is saved to disk. The archive file is not saved to disk.

  1. Inherit from CompressedFileSpider

  2. Set a data_type class attribute to the data type of the compressed files

  3. Optionally, add a resize_package = True class attribute to split large packages (e.g. greater than 100MB)

  4. Optionally, add a yield_non_archive_file = True class attribute if the spider requests both archive files and JSON files. Otherwise, the spider raises an UnknownArchiveFormatError exception.

  5. Write a start_requests() method to request the archive files

from kingfisher_scrapy.base_spiders import CompressedFileSpider
from kingfisher_scrapy.util import components

class MySpider(CompressedFileSpider):
    name = 'my_spider'

    # CompressedFileSpider
    data_type = 'release_package'

    def start_requests(self):
        yield self.build_request('', formatter=components(-1))


concatenated_json = True, line_delimited = True, root_path, data_type = 'release' and data_type = 'record' are not supported if resize_package = True.

dont_truncate = True#
yield_non_archive_file = False#
resize_package = False#
file_name_must_contain = ''#