Base Spider¶

class kingfisher_scrapy.base_spiders.base_spider.BaseSpider(*args, **kwargs)[source]¶

With respect to the data’s source:

If the source can support from_date and until_date spider arguments:
- Set a date_format class attribute to “date”, “datetime”, “year” or “year-month” (default “date”).
- Set a default_from_date class attribute to a date (“YYYY-MM-DD”), datetime (“YYYY-MM-DDTHH:MM:SS”), year (“YYYY”) or year-month (“YYYY-MM”).
- If the source stopped publishing, set a default_until_date class attribute to a date or datetime.
If the spider requires date parameters to be set, add a date_required = True class attribute, and set the date_format and default_from_date class attributes as above.
If the spider needs to parse the JSON response in its parse method, set dont_truncate = True.

Tip

If date_required is True, or if either the from_date or until_date spider arguments are set, then from_date defaults to the default_from_date class attribute, and until_date defaults to the get_default_until_date() return value (which is the current time, by default).

With respect to the data’s format:

If the data is not encoded using UTF-8, set an encoding class attribute to its encoding.
If the data is concatenated JSON, add a concatenated_json = True class attribute.
If the data is line-delimited JSON, add a line_delimited = True class attribute.
If the data can be invalid JSON, add a validate_json = True class attribute.
If the data embeds OCDS data within other objects or arrays, set a root_path class attribute to the path to the OCDS data, e.g. 'releasePackage' or 'results.item'.
If the data is in CSV or XLSX format, add a unflatten = True class attribute to convert it to JSON using Flatten Tool’s unflatten function. To pass arguments to unflatten, set a unflatten_args dict.
If the data source uses OCDS 1.0, add an ocds_version = '1.0' class attribute. This is used for the Kingfisher Process extension.

With respect to support for Kingfisher Collect’s features:

If the spider doesn’t work with the pluck command, set a skip_pluck class attribute to the reason.

VALID_DATE_FORMATS = {'date': '%Y-%m-%d', 'datetime': '%Y-%m-%dT%H:%M:%S', 'year': '%Y', 'year-month': '%Y-%m'}¶

date_required = False¶

dont_truncate = False¶

encoding = 'utf-8'¶

concatenated_json = False¶

line_delimited = False¶

validate_json = False¶

root_path = ''¶

unflatten = False¶

unflatten_args = {}¶

ocds_version = '1.1'¶

max_attempts = 1¶

retry_http_codes = []¶

available_steps = {'check', 'compile'}¶

date_format = 'date'¶

classmethod from_crawler(crawler, *args, **kwargs)[source]¶

is_http_success(response)[source]¶: Returns whether the response’s status is a non-2xx code.

is_http_retryable(response)[source]¶

Returns whether the response’s status is retryable.

Set the retry_http_codes class attribute to a list of status codes to retry.

get_start_time(format)[source]¶: Returns the formatted start time of the crawl.

get_retry_wait_time(response)[source]¶: Returns the number of seconds to wait before retrying a URL.

build_request(url, formatter, **kwargs)[source]¶

Returns a Scrapy request, with a file name added to the request’s meta attribute. If the file name doesn’t have a .json, .csv, .xlsx, .rar or .zip extension, it adds a .json extension.

If the last component of a URL’s path is unique, use it as the file name. For example:

>>> from kingfisher_scrapy.base_spiders import BaseSpider
>>> from kingfisher_scrapy.util import components
>>> url = 'https://example.com/package.json'
>>> formatter = components(-1)
>>> BaseSpider(name='my_spider').build_request(url, formatter=formatter).meta
{'file_name': 'package.json'}

To use a query string parameter as the file name:

>>> from kingfisher_scrapy.util import parameters
>>> url = 'https://example.com/packages?page=1&per_page=100'
>>> formatter = parameters('page')
>>> BaseSpider(name='my_spider').build_request(url, formatter=formatter).meta
{'file_name': 'page-1.json'}

To use a URL path component and a query string parameter as the file name:

>>> from kingfisher_scrapy.util import join
>>> url = 'https://example.com/packages?page=1&per_page=100'
>>> formatter = join(components(-1), parameters('page'))
>>> BaseSpider(name='my_spider').build_request(url, formatter=formatter).meta
{'file_name': 'packages-page-1.json'}

Parameters:

url (str) – the URL to request
formatter – a function that accepts a URL and returns a file name

Returns:

a Scrapy request

Return type:

scrapy.Request

build_file_from_response(response, /, *, data_type, **kwargs)[source]¶

Returns a File item to yield, based on the response to a request.

If the response body starts with a byte-order mark, it is removed.

build_file(*, file_name=None, url=None, data_type=None, data=None)[source]¶: Returns a File item to yield.

build_file_item(number, data, item)[source]¶: Returns a FileItem item to yield.

build_file_error_from_response(response, errors=None)[source]¶

Returns a FileError item to yield, based on the response to a request.

An errors keyword argument must be a dict, and should set an http_code key.

classmethod get_default_until_date(spider)[source]¶: Returns the default_until_date class attribute if truthy. Otherwise, returns the current time.