Compressed File Spider#
- class kingfisher_scrapy.base_spiders.compressed_file_spider.CompressedFileSpider(*args, **kwargs)[source]#
This class makes it easy to collect data from ZIP or RAR files. It assumes all files have the same data type. Each compressed file is saved to disk. The archive file is not saved to disk.
Inherit from
CompressedFileSpider
Set a
data_type
class attribute to the data type of the compressed filesOptionally, add a
resize_package = True
class attribute to split large packages (e.g. greater than 100MB)Optionally, add a
yield_non_archive_file = True
class attribute if the spider requests both archive files and JSON files. Otherwise, the spider raises anUnknownArchiveFormatError
exception.Write a
start_requests()
method to request the archive files
from kingfisher_scrapy.base_spiders import CompressedFileSpider from kingfisher_scrapy.util import components class MySpider(CompressedFileSpider): name = 'my_spider' # CompressedFileSpider data_type = 'release_package' def start_requests(self): yield self.build_request('https://example.com/api/packages.zip', formatter=components(-1))
Note
concatenated_json = True
,line_delimited = True
,root_path
,data_type = 'release'
anddata_type = 'record'
are not supported ifresize_package = True
.- dont_truncate = True#
- yield_non_archive_file = False#
- resize_package = False#
- file_name_must_contain = ''#