Index Spider¶
- class kingfisher_scrapy.base_spiders.index_spider.IndexSpider(*args, **kwargs)[source]¶
Collect data from an API that includes the total number of results or pages in its response, and receives pagination parameters like
page
orlimit
andoffset
.Set class attributes. Either:
Set
page_count_pointer
to the JSON Pointer for the total number of pages in the first response. The spider then yields a request for each page, incrementing apage
query string parameter in each request.Set
result_count_pointer
to the JSON Pointer for the total number of results, and setlimit
to the number of results to return per page, or to the JSON Pointer for it. Optionally, setuse_page = True
to configure the spider to send apage
query string parameter instead of a pair oflimit
andoffset
query string parameters. The spider then yields a request for each offset/page.
If the
page
query string parameter is zero-indexed, setstart_page = 0
.Set
formatter
to set the file name like inbuild_request()
. Ifpage_count_pointer
oruse_page = True
, it defaults toparameters(<param_page>)
. Otherwise, ifresult_count_pointer
is set anduse_page = False
, it defaults toparameters(<param_offset>)
. Ifformatter = None
, theurl_builder()
method mustreturn url, {'meta': {'file_name': ...}, ...}
.Write a
start_requests()
method to yield the initial URL. The request’scallback
parameter should be set toself.parse_list
.
If neither
page_count_pointer
norresult_count_pointer
can be used to create the URLs (e.g. if you need to query a separate URL that does not return JSON), you need to definerange_generator()
andurl_builder()
methods.range_generator()
should return page numbers or offset numbers.url_builder()
receives a page or offset fromrange_generator()
, and returns either a request URL, or a tuple of a request URL and keyword arguments (to pass tobuild_request()
).If the results are in ascending chronological order, set
chronological_order = 'asc'
.The
parse_list()
method parses responses as JSON data. To change the parser of these responses - for example, to check for an error response or extract the page count from an HTML page - override theparse_list_loader()
method. If this method returns aFileError
, thenparse_list()
yields it and returns.Otherwise, results are yielded from all responses by
parse()
. To change this method, set aparse_list_callback
class attribute to a method’s name as a string.The names of the query string parameters ‘page’, ‘limit’ and ‘offset’ are customizable. Define the
param_page
,param_limit
andparam_offset
class attributes to set the custom names.If a different URL is used for the initial request than for later requests, set the
base_url
class attribute to the base URL of later requests. In this case, results aren’t yielded from the response passed toparse_list
.- use_page = False¶
- start_page = 1¶
- chronological_order = 'desc'¶
- param_page = 'page'¶
- param_limit = 'limit'¶
- param_offset = 'offset'¶
- base_url = ''¶
- parse_list_callback = 'parse'¶