Index Spider

class kingfisher_scrapy.base_spiders.index_spider.IndexSpider(*args, **kwargs)[source]

This class can be used to collect data from an API that includes the total number of results or pages in its response, and receives pagination parameters like page or limit and offset. To create a spider that inherits from IndexSpider:

  1. Set class attributes. Either:

    1. Set page_count_pointer to the JSON Pointer for the total number of pages in the first response. The spider then yields a request for each page, incrementing a page query string parameter in each request.

    2. Set result_count_pointer to the JSON Pointer for the total number of results, and set limit to the number of results to return per page, or to the JSON Pointer for it. Optionally, set use_page = True to configure the spider to send a page query string parameter instead of a pair of limit and offset query string parameters. The spider then yields a request for each offset/page.

  2. If the page query string parameter is zero-indexed, set start_page = 0.

  3. Set formatter to set the file name like in build_request(). If page_count_pointer or use_page = True, it defaults to parameters(<param_page>). Otherwise, if result_count_pointer is set and use_page = False, it defaults to parameters(<param_offset>). If formatter = None, the url_builder() method must return url, {'meta': {'file_name': ...}, ...}.

  4. Write a start_requests() method to yield the initial URL. The request’s callback parameter should be set to self.parse_list.

If neither page_count_pointer nor result_count_pointer can be used to create the URLs (e.g. if you need to query a separate URL that does not return JSON), you need to define range_generator() and url_builder() methods. range_generator() should return page numbers or offset numbers. url_builder() receives a page or offset from range_generator(), and returns either a request URL, or a tuple of a request URL and keyword arguments (to pass to build_request()).

If the results are in ascending chronological order, set chronological_order = 'asc'.

The parse_list() method parses responses as JSON data. To change the parser of these responses - for example, to check for an error response or extract the page count from an HTML page - override the parse_list_loader() method. If this method returns a FileError, then parse_list() yields it and returns.

Otherwise, results are yielded from all responses by parse(). To change this method, set a parse_list_callback class attribute to a method’s name as a string.

The names of the query string parameters ‘page’, ‘limit’ and ‘offset’ are customizable. Define the param_page, param_limit and param_offset class attributes to set the custom names.

If a different URL is used for the initial request than for later requests, set the base_url class attribute to the base URL of later requests. In this case, results aren’t yielded from the response passed to parse_list.

use_page = False
start_page = 1
chronological_order = 'desc'
param_page = 'page'
param_limit = 'limit'
param_offset = 'offset'
base_url = ''
parse_list_callback = 'parse'
parse_list(response)[source]
parse_list_loader(response)[source]
page_count_range_generator(data, response)[source]
pages_url_builder(value, data, response)[source]
limit_offset_range_generator(data, response)[source]
limit_offset_url_builder(value, data, response)[source]
result_count_range_generator(data, response)[source]