Command-line interface#
Most subcommands of the scrapy
command are defined by Scrapy. You were already introduced to scrapy list
and scrapy crawl
. Kingfisher Collect adds a few more:
pluck#
Plucks one data value per publisher. It writes a CSV file with the results, and a pluck_skipped.json
file giving the reason for which some spiders were skipped. It writes no OCDS data files.
- -p STR, --package-pointer=STR
The JSON Pointer to the value in the package.
- -r STR, --release-pointer=STR
The JSON Pointer to the value in the release.
- -t NUM, --truncate=NUM
Truncate the value to this number of characters.
- --max-bytes=NUM
Stop downloading an OCDS file after reading at least this many bytes.
Pass spider
names as positional arguments to run specific spiders, instead of all spiders.
If you’re using --package-pointer
, it is recommended to use the --max-bytes
option to limit the number of bytes downloaded. For example, you can set --max-bytes 10000
, because package metadata tends to be located at the start of files.
Note
--max-bytes
is ignored for ZIP and RAR files, which must be downloaded in full to be read.
Get each publisher’s publication policy#
scrapy pluck --package-pointer /publicationPolicy
This writes a pluck-package-publicationPolicy.csv
file, in which the second column is the spider’s name, and the first column is either:
The value of the
publicationPolicy
field in the packageAn error message, prefixed by
error:
The reason for which the spider was closed, prefixed by
closed:
Get the latest release date#
And truncate to the date component of the datetime:
scrapy pluck --release-pointer /date --truncate 10
This writes a pluck-release-date.csv
file.
Get the publisher’s name#
scrapy pluck --package-pointer /publisher/name
This writes a pluck-package-publisher-name.csv
file.
crawlall#
Runs all spiders.
- --dry-run
Runs the spiders without writing any files. It stops after collecting one file or file item from each spider. This can be used to test whether any spiders are broken. Add the
--logfile debug.log
option to write the output to a log file for easier review.- --sample=NUM
The number of files to write. This can be used to collect a sample from each spider.
Pass spider
names as positional arguments to run specific spiders, instead of all spiders.
scrapy crawlall --dry-run
Note
One of --dry-run
or --sample
must be set. If you want to run all spiders to completion, use Scrapyd, which has better scheduling control and process management.
checkall#
Checks that spiders are documented and well-implemented. It reports whether information is missing, out-of-order, or unexpected in the docstring, and if an expected spider argument isn’t implemented.
scrapy checkall
updatedocs#
This command is for developers of Kingfisher Collect. When a new spider is added, or when a spider’s class-level docstring is updated, the developer should run this command to update docs/spiders.rst
:
scrapy updatedocs