Periodic Spider

class kingfisher_scrapy.base_spiders.periodic_spider.PeriodicSpider(*args, **kwargs)[source]

Collect data from an API that accepts a year, year-month, date or datetime as a query string parameter or URL path component.

  1. Inherit from PeriodicSpider

  2. Set a date_format class attribute to “year”, “year-month”, “date” or “datetime”

  3. Set a pattern class attribute to a URL pattern, with placeholders. If the date_format is “year”, then a year is passed to the placeholder as an int. If the date_format is “year-month”, then the first day of the month is passed to the placeholder as a date, which you can format as, for example:

  4. Set a formatter class attribute to set the file name like in build_request()

  5. Set a default_from_date class attribute to a year (“YYYY”) or year-month (“YYYY-MM”)

  6. If the source stopped publishing, set a default_until_date class attribute to a year or year-month

  7. Optionally, if the date_format is “date”, set a step class attribute to indicate the length of intervals, in days - otherwise, it defaults to 1

  8. Optionally, set a start_callback class attribute to a method’s name as a string - otherwise, it defaults to parse()

If sample is set, the data from the most recent year or month is retrieved.

date_required = True
step = 1
start_callback = 'parse'
async start()[source]

Yield the initial Request objects to send.

Added in version 2.13.

For example:

from scrapy import Request, Spider


class MySpider(Spider):
    name = "myspider"

    async def start(self):
        yield Request("https://toscrape.com/")

The default implementation reads URLs from start_urls and yields a request for each with dont_filter enabled. It is functionally equivalent to:

async def start(self):
    for url in self.start_urls:
        yield Request(url, dont_filter=True)

You can also yield items. For example:

async def start(self):
    yield {"foo": "bar"}

To write spiders that work on Scrapy versions lower than 2.13, define also a synchronous start_requests() method that returns an iterable. For example:

def start_requests(self):
    yield Request("https://toscrape.com/")

See also

Start requests

build_urls(from_date, until_date=None)[source]

Yield one or more URLs for the given date.