Command-line interface¶
Most subcommands of the scrapy command are defined by Scrapy. You were already introduced to scrapy list and scrapy crawl. Kingfisher Collect adds a few more:
pluck¶
Plucks one data value per publisher. It writes a CSV file with the results, and a pluck_skipped.json file giving the reason for which some spiders were skipped. It writes no OCDS data files.
- -p STR, --package-pointer=STR
The JSON Pointer to the value in the package.
- -r STR, --release-pointer=STR
The JSON Pointer to the value in the release.
- -t NUM, --truncate=NUM
Truncate the value to this number of characters.
- --max-bytes=NUM
Stop downloading an OCDS file after reading at least this many bytes.
Pass spider names as positional arguments to run specific spiders, instead of all spiders.
If you’re using --package-pointer, it is recommended to use the --max-bytes option to limit the number of bytes downloaded. For example, you can set --max-bytes 10000, because package metadata tends to be located at the start of files.
Note
--max-bytes is ignored for ZIP and RAR files, which must be downloaded in full to be read.
Get each publisher’s publication policy¶
scrapy pluck --package-pointer /publicationPolicy
This writes a pluck-package-publicationPolicy.csv file, in which the second column is the spider’s name, and the first column is either:
The value of the
publicationPolicyfield in the packageAn error message, prefixed by
error:The reason for which the spider was closed, prefixed by
closed:
Get the latest release date¶
And truncate to the date component of the datetime:
scrapy pluck --release-pointer /date --truncate 10
This writes a pluck-release-date.csv file.
Get the publisher’s name¶
scrapy pluck --package-pointer /publisher/name
This writes a pluck-package-publisher-name.csv file.
crawlall¶
Runs all spiders.
- --dry-run
Runs the spiders without writing any files. It stops after collecting one file or file item from each spider. This can be used to test whether any spiders are broken. Add the
--logfile debug.logoption to write the output to a log file for easier review.- --sample=NUM
The number of files to write. This can be used to collect a sample from each spider.
Pass spider names as positional arguments to run specific spiders, instead of all spiders.
scrapy crawlall --dry-run
Note
One of --dry-run or --sample must be set. If you want to run all spiders to completion, use Scrapyd, which has better scheduling control and process management.
checkall¶
Checks that spiders are documented and well-implemented. It reports whether:
The names of files, classes and spiders mismatch.
Information is missing, unexpected or out-of-order in the docstring, including spider arguments.
A publication in the Data Registry has GitHub issues, but isn’t frozen, or vice versa.
scrapy checkall
updatedocs¶
This command is for developers of Kingfisher Collect. When a new spider is added, or when a spider’s class-level docstring is updated, the developer should run this command to update docs/spiders.rst:
scrapy updatedocs