Compressed File Spider¶
- class kingfisher_scrapy.base_spiders.compressed_file_spider.CompressedFileSpider(*args, **kwargs)[source]¶
Collect data from ZIP or RAR files.
It assumes all files have the same data type. Each compressed file is saved to disk. The archive file is not saved to disk.
Inherit from
CompressedFileSpiderSet a
data_typeclass attribute to the data type of the compressed filesOptionally, add a
resize_package = Trueclass attribute to split large packages (e.g. greater than 100MB)Optionally, add a
yield_non_archive_file = Trueclass attribute if the spider requests both archive files and JSON files. Otherwise, the spider raises anUnknownArchiveFormatErrorexception.Optionally, add a
file_name_must_contain = 'text'class attribute to only decompress the files whose paths contain the given text.Optionally, add a
file_name_must_not_contain = 'text'class attribute to only decompress the files whose paths do not contain the given text.Optionally, add a
skip_empty_releases = Trueclass attribute to skip files with emptyreleasesarrays.Write a
start()method to request the archive files
from kingfisher_scrapy.base_spiders import CompressedFileSpider from kingfisher_scrapy.util import components class MySpider(CompressedFileSpider): name = 'my_spider' # CompressedFileSpider data_type = 'release_package' async def start(self): yield self.build_request('https://example.com/api/packages.zip', formatter=components(-1))
Note
concatenated_json = True,line_delimited = True,root_path,data_type = 'release'anddata_type = 'record'are not supported ifresize_package = True.- dont_truncate = True¶
- yield_non_archive_file = False¶
- file_name_must_contain = ''¶
- file_name_must_not_contain = ''¶
- skip_empty_releases = False¶