Base Spider¶
- class kingfisher_scrapy.base_spiders.base_spider.BaseSpider(*args, **kwargs)[source]¶
Base class for all spiders.
With respect to the data’s source:
If the source can support
from_date
anduntil_date
spider arguments:Set a
date_format
class attribute to “date”, “datetime”, “year” or “year-month” (default “date”).Set a
default_from_date
class attribute to a date (“YYYY-MM-DD”), datetime (“YYYY-MM-DDTHH:MM:SS”), year (“YYYY”) or year-month (“YYYY-MM”).If the source stopped publishing, set a
default_until_date
class attribute to a date or datetime.
If the spider requires date parameters to be set, add a
date_required = True
class attribute, and set thedate_format
anddefault_from_date
class attributes as above.If the spider needs to parse the JSON response in its
parse
method, setdont_truncate = True
.
Tip
If
date_required
isTrue
, or if either thefrom_date
oruntil_date
spider arguments are set, thenfrom_date
defaults to thedefault_from_date
class attribute, anduntil_date
defaults to theget_default_until_date()
return value (which is the current time, by default).With respect to the data’s format:
If the data is not encoded using UTF-8, set an
encoding
class attribute to its encoding.If the data is concatenated JSON, add a
concatenated_json = True
class attribute.If the data is line-delimited JSON, add a
line_delimited = True
class attribute.If the data can be invalid JSON, add a
validate_json = True
class attribute.If the data embeds OCDS data within other objects or arrays, set a
root_path
class attribute to the path to the OCDS data, e.g.'releasePackage'
or'results.item'
.If the data is in CSV or XLSX format, add a
unflatten = True
class attribute to convert it to JSON using Flatten Tool’sunflatten
function. To pass arguments tounflatten
, set aunflatten_args
dict.If the data source uses OCDS 1.0, add an
ocds_version = '1.0'
class attribute. This is used for the Kingfisher Process extension.
With respect to support for Kingfisher Collect’s features:
If the spider doesn’t work with the
pluck
command, set askip_pluck
class attribute to the reason.
- VALID_DATE_FORMATS = {'date': '%Y-%m-%d', 'datetime': '%Y-%m-%dT%H:%M:%S', 'year': '%Y', 'year-month': '%Y-%m'}¶
- date_required = False¶
- dont_truncate = False¶
- encoding = 'utf-8'¶
- concatenated_json = False¶
- line_delimited = False¶
- validate_json = False¶
- root_path = ''¶
- unflatten = False¶
- unflatten_args = {}¶
- ocds_version = '1.1'¶
- max_attempts = 1¶
- retry_http_codes = []¶
- available_steps = {'check', 'compile'}¶
- date_format = 'date'¶
- is_http_retryable(response)[source]¶
Return whether the response’s status is retryable.
Set the
retry_http_codes
class attribute to a list of status codes to retry.
- build_request(url, formatter, **kwargs)[source]¶
Return a Scrapy request, with a file name added to the request’s
meta
attribute. If the file name doesn’t have a.json
,.csv
,.xlsx
,.rar
or.zip
extension, it adds a.json
extension.If the last component of a URL’s path is unique, use it as the file name. For example:
>>> from kingfisher_scrapy.base_spiders import BaseSpider >>> from kingfisher_scrapy.util import components >>> url = 'https://example.com/package.json' >>> formatter = components(-1) >>> BaseSpider(name='my_spider').build_request(url, formatter=formatter).meta {'file_name': 'package.json'}
To use a query string parameter as the file name:
>>> from kingfisher_scrapy.util import parameters >>> url = 'https://example.com/packages?page=1&per_page=100' >>> formatter = parameters('page') >>> BaseSpider(name='my_spider').build_request(url, formatter=formatter).meta {'file_name': 'page-1.json'}
To use a URL path component and a query string parameter as the file name:
>>> from kingfisher_scrapy.util import join >>> url = 'https://example.com/packages?page=1&per_page=100' >>> formatter = join(components(-1), parameters('page')) >>> BaseSpider(name='my_spider').build_request(url, formatter=formatter).meta {'file_name': 'packages-page-1.json'}
- Parameters:
url (str) – the URL to request
formatter – a function that accepts a URL and returns a file name
- Returns:
a Scrapy request
- Return type:
scrapy.Request
- build_file_from_response(response, /, *, data_type, **kwargs)[source]¶
Return a File item to yield, based on the response to a request.
If the response body starts with a byte-order mark, it is removed.
- build_file(*, file_name=None, url=None, data_type=None, data=None)[source]¶
Return a File item to yield.