Download data to a remote server

Note

This is an advanced guide that assumes knowledge of web hosting.

Some spiders take a long time to run (days or weeks), and some data sources have a lot of OCDS data (GBs). In such cases, you might not want to download data to your computer, and instead use a separate machine. You have two options:

  1. Follow the same instructions as before, and start crawls on the other machine

  2. Install Scrapyd on a remote server (this guide)

Scrapyd also makes it possible for many users to schedule crawls on the same machine.

Install Scrapyd

On the remote server, follow these instructions to install Scrapyd, then install Kingfisher Collect’s requirements in the same environment as Scrapyd:

curl -O https://raw.githubusercontent.com/open-contracting/kingfisher-collect/main/requirements.txt
pip install -r requirements.txt

Start Scrapyd

On the remote server, follow these instructions to start Scrapyd. Scrapyd should then be accessible at http://your-remote-server:6800/. If not, refer to Scrapyd’s documentation or its GitHub issues to troubleshoot.

Using the Scrapyd web interface

  • To see the scheduled, running and finished crawls, click “Jobs”

  • To browse the crawls’ log files, click “Logs”

For help understanding the log files, read Log files.

Note

If Scrapyd restarts or the server reboots, all scheduled crawls are cancelled, all running crawls are interrupted, and all finished crawls are delisted from the web interface. However, you can still browse the crawls’ logs files to review the finished crawls.

Install Kingfisher Collect

On your local machine, install Kingfisher Collect.

Configure Kingfisher Collect

Create a ~/.config/scrapy.cfg file using the template below, and set the url variable to point to the remote server:

[deploy:kingfisher]
url = http://localhost:6800/
project = kingfisher

You need to at least replace localhost with the remote server’s domain name. If you changed the http_port variable in Scrapyd’s configuration file, you need to replace 6800.

If you changed the FILES_STORE variable when installing Kingfisher Collect, that same directory needs to exist on the remote server, and the scrapyd process needs permission to write to it. If you are using the default value, then files will be stored in a data directory under the Scrapyd directory on the remote server.

Deploy spiders

On your local machine, deploy the spiders in Kingfisher Collect to Scrapyd, using the scrapyd-deploy command, which was installed with Kingfisher Collect:

scrapyd-deploy kingfisher

Remember to run this command every time you add or update a spider.

Collect data

Note

In all examples below, replace localhost with your remote server’s domain name, and replace spider_name with a spider’s name.

You’re now ready to collect data!

To list the spiders, use Scrapyd’s listspiders.json API endpoint:

curl 'http://localhost:6800/listspiders.json?project=kingfisher'

To make the list of spiders easier to read, pipe the response through python -m json.tool:

curl 'http://localhost:6800/listspiders.json?project=kingfisher' | python -m json.tool

The spiders’ names might be ambiguous. If you’re unsure which spider to run, you can find more information about them on the Spiders page, or contact the Data Support Team.

To run a spider (that is, to schedule a “crawl”), use Scrapyd’s schedule.json API endpoint:

curl http://localhost:6800/schedule.json -d project=kingfisher -d spider=spider_name

If successful, you’ll see something like:

{"status": "ok", "jobid": "6487ec79947edab326d6db28a2d86511e8247444"}

To download only a sample of the available data, filter data or collect data incrementally, use -d instead of -a before each spider argument:

curl http://localhost:6800/schedule.json -d project=kingfisher -d spider=spider_name -d sample=10

To use an HTTP and/or HTTPS proxy, use -d setting= instead of -s before each overridden setting:

curl http://localhost:6800/schedule.json -d project=kingfisher -d spider=spider_name -d setting=HTTPPROXY_ENABLED=True

Note

The http_proxy and/or https_proxy environment variables must already be set in Scrapyd’s environment on the remote server.

If the crawl’s log file contains HTTP 429 Too Many Requests errors, you can make the spider wait between requests by setting the DOWNLOAD_DELAY setting (in seconds):

curl http://localhost:6800/schedule.json -d project=kingfisher -d spider=spider_name -d setting=DOWNLOAD_DELAY=1