Index Spider¶
- class kingfisher_scrapy.base_spiders.index_spider.IndexSpider(*args, **kwargs)[source]¶
Collect data from an API that includes the total number of results or pages in its response, and receives pagination parameters like
pageorlimitandoffset.Set class attributes. Either:
Set
page_count_pointerto the JSON Pointer for the total number of pages in the first response. The spider then yields a request for each page, incrementing apagequery string parameter in each request.Set
result_count_pointerto the JSON Pointer for the total number of results, and setlimitto the number of results to return per page, or to the JSON Pointer for it. Optionally, setuse_page = Trueto configure the spider to send apagequery string parameter instead of a pair oflimitandoffsetquery string parameters. The spider then yields a request for each offset/page.
If the
pagequery string parameter is zero-indexed, setstart_page = 0.Set
formatterto set the file name like inbuild_request(). Ifpage_count_pointeroruse_page = True, it defaults toparameters(<param_page>). Otherwise, ifresult_count_pointeris set anduse_page = False, it defaults toparameters(<param_offset>). Ifformatter = None, theurl_builder()method mustreturn url, {'meta': {'file_name': ...}, ...}.Write a
start()method to yield the initial URL. The request’scallbackparameter should be set toself.parse_list.
If neither
page_count_pointernorresult_count_pointercan be used to create the URLs (e.g. if you need to query a separate URL that does not return JSON), you need to definerange_generator()andurl_builder()methods.range_generator()should return page numbers or offset numbers.url_builder()receives a page or offset fromrange_generator(), and returns either a request URL, or a tuple of a request URL and keyword arguments (to pass tobuild_request()).If the results are in ascending chronological order, set
chronological_order = 'asc'.The
parse_list()method parses responses as JSON data. To change the parser of these responses - for example, to check for an error response or extract the page count from an HTML page - override theparse_list_loader()method. If this method returnsNone, thenparse_list()returns.Otherwise, results are yielded from all responses by
parse(). To change this method, set aparse_list_callbackclass attribute to a method’s name as a string.The names of the query string parameters ‘page’, ‘limit’ and ‘offset’ are customizable. Define the
param_page,param_limitandparam_offsetclass attributes to set the custom names.If a different URL is used for the initial request than for later requests, set the
base_urlclass attribute to the base URL of later requests. In this case, results aren’t yielded from the response passed toparse_list.- use_page = False¶
- start_page = 1¶
- chronological_order = 'desc'¶
- param_page = 'page'¶
- param_limit = 'limit'¶
- param_offset = 'offset'¶
- base_url = ''¶
- parse_list_callback = 'parse'¶