Utilities#
- kingfisher_scrapy.util.components(start, stop=None)[source]#
Returns a function that returns the selected non-empty path components, excluding the
.json
extension.>>> components(-1)('http://example.com/api/planning.json') 'planning'
>>> components(-2, -1)('http://example.com/api/planning/package.json') 'planning'
- kingfisher_scrapy.util.parameters(*keys)[source]#
Returns a function that returns the selected query string parameters.
>>> parameters('page')('http://example.com/api/packages.json?page=1') 'page-1'
>>> parameters('year', 'page')('http://example.com/api/packages.json?year=2000&page=1') 'year-2000-page-1'
- kingfisher_scrapy.util.join(*functions, extension=None)[source]#
Returns a function that joins the given functions’ outputs.
>>> join(components(-1), parameters('page'))('http://example.com/api/planning.json?page=1') 'planning-page-1'
- kingfisher_scrapy.util.handle_http_error(decorated)[source]#
A decorator for spider parse methods.
if
is_http_success()
returnsTrue
, yields from the decorated method.If
is_http_retryable()
returnsTrue
and the number of attempts is less than the spider’smax_attempts
class attribute, retries the request, after waiting the number of seconds returned byget_retry_wait_time()
.Note
Scrapy always retries a connection error, like a DNS issue. Scrapy also retries an error code if it is one of RETRY_HTTP_CODES. To limit or disable this behavior, set or update the spider’s
custom_settings
class attribute. For example:custom_settings = { # Don't let Scrapy handle error codes. 'RETRY_HTTP_CODES': [], }
Otherwise, yields a
FileError
usingbuild_file_error_from_response()
.
- kingfisher_scrapy.util.date_range_by_interval(start, stop, step)[source]#
Yields date ranges from the
start
date to thestop
date, in intervals ofstep
days, in reverse chronological order.
- kingfisher_scrapy.util.date_range_by_month(start, stop)[source]#
Yields the first day of the month as a
date
from thestart
to thestop
dates, in reverse chronological order.
- kingfisher_scrapy.util.date_range_by_year(start, stop)[source]#
Returns the year as an
int
from thestart
to thestop
years, in reverse chronological order.
- kingfisher_scrapy.util.get_parameter_value(url, key)[source]#
Returns the first value of the query string parameter.
- kingfisher_scrapy.util.replace_parameters(url, **kwargs)[source]#
Returns a URL after updating the query string parameters’ values.
- kingfisher_scrapy.util.append_path_components(url, path)[source]#
Returns a URL after appending path components to its path.
- kingfisher_scrapy.util.add_query_string(method, params)[source]#
Returns a function that yields the requests yielded by the wrapped method, after updating the query string parameter values in each request’s URL.
- kingfisher_scrapy.util.add_path_components(method, path)[source]#
Returns a function that yields the requests yielded by the wrapped method, after appending path components to each request’s URL.
- kingfisher_scrapy.util.items_basecoro(target, prefix, map_type=None, skip_key=None)[source]#
This is copied from
ijson/common.py
. Askip_key
argument is added. If theskip_key
is in the current path, the current event is skipped. Otherwise, the method is identical.
- kingfisher_scrapy.util.items(events, prefix, map_type=None, skip_key=None)[source]#
This is copied from
ijson/common.py
. Askip_key
argument is added, which is passed as a keyword argument toitems_basecoro()
. Otherwise, the method is identical.
- kingfisher_scrapy.util.default(obj)[source]#
Dumps JSON to a string, converting decimals and iterables, and returns it.
- kingfisher_scrapy.util.json_dumps(obj, **kwargs)[source]#
Dumps JSON to string, using an extended JSON encoder.
Use this method for JSON data read by ijson, which uses decimals for JSON numbers.
- kingfisher_scrapy.util.json_dump(obj, f, **kwargs)[source]#
Dumps JSON to a file, using an extended JSON encoder.
Use this method for JSON data read by ijson, which uses decimals for JSON numbers.