Database Store

class kingfisher_scrapy.extensions.database_store.DatabaseStore(database_url, files_store_directory)[source]

If the DATABASE_URL Scrapy setting and the crawl_time spider argument are set, the OCDS data is stored in a PostgreSQL database, incrementally.

This extension stores data in the “data” column of a table named after the spider, or the table_name spider argument (if set). When the spider is opened, if the table doesn’t exist, it is created. The spider’s from_date attribute is then set, in order of precedence, to: the from_date spider argument (unless equal to the spider’s default_from_date class attribute); the maximum value of the date field of the stored data (if any); the spider’s default_from_date class attribute (if set).

When the spider is closed, this extension reads the data written by the FilesStore extension to the crawl directory that matches the crawl_time spider argument. If the compile_releases spider argument is set, it creates compiled releases, using individual releases. Then, it recreates the table, and inserts either the compiled releases if the compile_releases spider argument is set, the individual releases in release packages (if the spider returns releases), or the compiled releases in record packages (if the spider returns records).

Warning

If the compile_releases spider argument is set, spiders that return records without embedded releases are not supported. If it isn’t set, then spiders that return records without compiled releases are not supported.

To perform incremental updates, the OCDS data in the crawl directory must not be deleted between crawls.

classmethod from_crawler(crawler)[source]
spider_opened(spider)[source]
spider_closed(spider, reason)[source]
create_table(table)[source]
yield_items_from_directory(crawl_directory, prefix='')[source]
format(statement, **kwargs)[source]

Formats the SQL statement, expressed as a format string with keyword arguments. A keyword argument’s value is converted to a SQL identifier, or a list of SQL identifiers, unless it’s already a sql object.

execute(statement, variables=None, **kwargs)[source]

Executes the SQL statement.

get_table_name(spider)[source]