Spider Middlewares

kingfisher_scrapy.spidermiddlewares.sample_filled(spider, number)[source]
kingfisher_scrapy.spidermiddlewares.group_size(spider)[source]
kingfisher_scrapy.spidermiddlewares.read_data_from_file_if_any(item)[source]
class kingfisher_scrapy.spidermiddlewares.BaseSpiderMiddleware(crawler)[source]

Base class for spider middlewares that need access to the spider instance.

classmethod from_crawler(crawler)[source]
class kingfisher_scrapy.spidermiddlewares.ConcatenatedJSONMiddleware(crawler)[source]

If the spider’s concatenated_json class attribute is True, yield each object of the File as a FileItem. Otherwise, yield the original item.

async process_spider_output(response, result)[source]

Return a generator of FileItem objects, in which the data field is parsed JSON.

class kingfisher_scrapy.spidermiddlewares.LineDelimitedMiddleware(crawler)[source]

If the spider’s line_delimited class attribute is True, yield each line of the File as a FileItem. Otherwise, yield the original item.

async process_spider_output(response, result)[source]

Return a generator of FileItem objects, in which the data field is bytes.

class kingfisher_scrapy.spidermiddlewares.ValidateJSONMiddleware(crawler)[source]

If the spider’s validate_json class attribute is True, check if the item’s data field is valid JSON. If not, yield nothing. Otherwise, yield the original item.

async process_spider_output(response, result)[source]

Return a generator of File or FileItem objects, in which the data field is valid JSON.

class kingfisher_scrapy.spidermiddlewares.RootPathMiddleware(crawler)[source]

If the spider’s root_path class attribute is non-empty, replace the item’s data with the objects at that prefix; if there are multiple releases, records or packages at that prefix, combine them into packages in groups of 100, and update the item’s data_type if needed. Otherwise, yield the original item.

async process_spider_output(response, result)[source]

Return a generator of File or FileItem objects, in which the data field is parsed JSON.

class kingfisher_scrapy.spidermiddlewares.AddPackageMiddleware(crawler)[source]

If the spider’s data_type class attribute is “release” or “record”, wrap the item’s data in an appropriate package, and update the item’s data_type. Otherwise, yield the original item.

async process_spider_output(response, result)[source]

Return a generator of File or FileItem objects, in which the data field is parsed JSON.

class kingfisher_scrapy.spidermiddlewares.ResizePackageMiddleware(crawler)[source]

If the spider’s resize_package class attribute is True, split the package into packages of 100 releases or records each. Otherwise, yield the original item.

Optionally, implement an ocid_fallback method on the spider, which accepts a release (or record) and returns an an ocid value, to be used if the ocid field is not set.

async process_spider_output(response, result)[source]

Return a generator of FileItem objects, in which the data field is a string.

The spider must yield items whose data field has package and data keys.

class kingfisher_scrapy.spidermiddlewares.ReadDataMiddleware(crawler)[source]

If the item’s data is a file descriptor, replace the item’s data with the file’s contents and close the file descriptor. Otherwise, yield the original item.

async process_spider_output(response, result)[source]

Return a generator of File objects, in which the data field is bytes.

class kingfisher_scrapy.spidermiddlewares.HttpErrorMiddleware(crawler)[source]

Handle HTTP errors raised by Scrapy’s HttpErrorMiddleware.

If is_http_retryable() returns True and the number of attempts is less than the spider’s max_attempts class attribute, retries the request, after waiting the number of seconds returned by get_retry_wait_time().

Otherwise, logs an error message.

process_spider_exception(response, exception)[source]
class kingfisher_scrapy.spidermiddlewares.RetryDataErrorMiddleware[source]

Retry a request up to 3 times.

Either when the spider raises a BadZipFile exception, on the assumption that the response was truncated, or when the spider raises a RetryableError exception.

process_spider_exception(response, exception)[source]