
kingfisher_scrapy.util.components(start, stop=None)[source]

Returns a function that returns the selected non-empty path components, excluding the .json extension.

>>> components(-1)('http://example.com/api/planning.json')
>>> components(-2, -1)('http://example.com/api/planning/package.json')

Returns a function that returns the selected query string parameters.

>>> parameters('page')('http://example.com/api/packages.json?page=1')
>>> parameters('year', 'page')('http://example.com/api/packages.json?year=2000&page=1')
kingfisher_scrapy.util.join(*functions, extension=None)[source]

Returns a function that joins the given functions’ outputs and sets the file extension, if provided.

>>> join(components(-1), parameters('page'))('http://example.com/api/planning.json?page=1')

A decorator for spider parse methods.

if is_http_success() returns True, yields from the decorated method.

If is_http_retryable() returns True and the number of attempts is less than the spider’s max_attempts class attribute, retries the request, after waiting the number of seconds returned by get_retry_wait_time().


Scrapy always retries a connection error, like a DNS issue. Scrapy also retries an error code if it is one of RETRY_HTTP_CODES. To limit or disable this behavior, set or update the spider’s custom_settings class attribute. For example:

custom_settings = {
    # Don't let Scrapy handle error codes.

Otherwise, yields a FileError using build_file_error_from_response().

kingfisher_scrapy.util.date_range_by_interval(start, stop, step)[source]

Yields date ranges from the start date to the stop date, in intervals of step days, in reverse chronological order.

kingfisher_scrapy.util.date_range_by_month(start, stop)[source]

Yields the first day of the month as a date from the start to the stop dates, in reverse chronological order.

kingfisher_scrapy.util.date_range_by_year(start, stop)[source]

Returns the year as an int from the start to the stop years, in reverse chronological order.

kingfisher_scrapy.util.get_parameter_value(url, key)[source]

Returns the first value of the query string parameter.

kingfisher_scrapy.util.replace_parameters(url, **kwargs)[source]

Returns a URL after updating the query string parameters’ values.

kingfisher_scrapy.util.append_path_components(url, path)[source]

Returns a URL after appending path components to its path.

kingfisher_scrapy.util.add_query_string(method, params)[source]

Returns a function that yields the requests yielded by the wrapped method, after updating the query string parameter values in each request’s URL.

kingfisher_scrapy.util.add_path_components(method, path)[source]

Returns a function that yields the requests yielded by the wrapped method, after appending path components to each request’s URL.

kingfisher_scrapy.util.items_basecoro(target, prefix, map_type=None, skip_key=None)[source]

This is copied from ijson/common.py. A skip_key argument is added. If the skip_key is in the current path, the current event is skipped. Otherwise, the method is identical.

kingfisher_scrapy.util.items(events, prefix, map_type=None, skip_key=None)[source]

This is copied from ijson/common.py. A skip_key argument is added, which is passed as a keyword argument to items_basecoro(). Otherwise, the method is identical.


Dumps JSON to a string, converting decimals and iterables, and returns it.

kingfisher_scrapy.util.json_dumps(obj, **kwargs)[source]

Dumps JSON to string, using an extended JSON encoder.

Use this method for JSON data read by ijson, which uses decimals for JSON numbers.

kingfisher_scrapy.util.json_dump(obj, f, **kwargs)[source]

Dumps JSON to a file, using an extended JSON encoder.

Use this method for JSON data read by ijson, which uses decimals for JSON numbers.

class kingfisher_scrapy.util.TranscodeFile(file, encoding)[source]

Re-encodes bytes read from the file to UTF-8.

kingfisher_scrapy.util.transcode_bytes(data, encoding)[source]

Re-encodes bytes to UTF-8.

kingfisher_scrapy.util.transcode(spider, function, data, *args, **kwargs)[source]
kingfisher_scrapy.util.grouper(iterable, n, fillvalue=None)[source]

Given a filename returns its name and extension in two separate strings >>> get_file_name_and_extension(‘test.json’) (‘test’, ‘json’)