Compressed File Spider

class kingfisher_scrapy.base_spiders.compressed_file_spider.CompressedFileSpider(*args, **kwargs)[source]

This class makes it easy to collect data from ZIP or RAR files. It assumes all files have the same data type. Each compressed file is saved to disk. The archive file is not saved to disk.

  1. Inherit from CompressedFileSpider

  2. Set a data_type class attribute to the data type of the compressed files

  3. Optionally, add a resize_package = True class attribute to split large packages (e.g. greater than 100MB)

  4. Optionally, add a yield_non_archive_file = True class attribute if the spider requests both archive files and JSON files. Otherwise, the spider raises an UnknownArchiveFormatError exception.

  5. Optionally, add a file_name_must_contain = 'text' class attribute to only decompress the files whose names contain the given text.

  6. Optionally, add a file_name_must_not_contain = 'text' class attribute to only decompress the files whose names do not contain the given text.

  7. Write a start_requests() method to request the archive files

from kingfisher_scrapy.base_spiders import CompressedFileSpider
from kingfisher_scrapy.util import components

class MySpider(CompressedFileSpider):
    name = 'my_spider'

    # CompressedFileSpider
    data_type = 'release_package'

    def start_requests(self):
        yield self.build_request('https://example.com/api/packages.zip', formatter=components(-1))

Note

concatenated_json = True, line_delimited = True, root_path, data_type = 'release' and data_type = 'record' are not supported if resize_package = True.

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Self

dont_truncate = True
yield_non_archive_file = False
resize_package = False
file_name_must_contain = ''
file_name_must_not_contain = ''
parse(response)[source]