Download data to your computer

This page will guide you through installing Kingfisher Collect and using it to collect data from data sources.

Install Kingfisher Collect

To use Kingfisher Collect, you need access to a Unix-like shell (some are available for Windows). Git and Python (version 3.10 or greater) must be installed.

When ready, open a shell, and run:

git clone https://github.com/open-contracting/kingfisher-collect.git
cd kingfisher-collect
pip install -r requirements.txt

Tip

If you encounter an error relating to psycopg2, try instead:

pip install psycopg2-binary -r requirements.txt

To extract data from RAR files, you must have the unrar or unar command-line utility.

The next steps assume that you have changed to the kingfisher-collect directory (the cd command above).

Configure Kingfisher Collect

Note

This step is optional.

To use a different directory than the default data directory to store files, change the FILES_STORE variable in the kingfisher_scrapy/settings.py file. It can be a relative path (like data) or an absolute path (like /home/user/path).

FILES_STORE = '/home/user/path'

Collect data

You’re now ready to collect data!

To list the spiders, run:

scrapy list

The spiders’ names might be ambiguous. If you’re unsure which spider to run, you can find more information about them on the Spiders page, or contact the Data Support Team.

To run a spider (that is, to start a “crawl”), replace spider_name below with the name of a spider from scrapy list above:

scrapy crawl spider_name

Download a sample

To download only a sample of the available data, set the sample size with the sample spider argument:

scrapy crawl spider_name -a sample=10

Scrapy will then output a log of its activity.

Note

_sample will be added to the spider’s directory, e.g. kingfisher-collect/data/zambia_sample.

Filter the data

Each spider supports different filters, which you can set as spider arguments. For example:

scrapy crawl colombia -a from_date=2015-01-01

You can find which filters a spider supports on the Spiders page.

Not all of an API’s features are exposed by Kingfisher Collect. Each spider links to its API documentation in its metadata, where you can learn what filters the API supports. If the filters are implemented as query string parameters, you can apply multiple filters with, for example:

scrapy crawl spider_name -a qs:parameter1=value1 -a qs:parameter2=value2

If the filters are implemented as path parameters, you can append path components to each URL, for example:

scrapy crawl spider_name -a path=key1/value1/key2/value2/value3

Collect data incrementally

By default, scrapy crawl downloads all the data from the source. You can use spider arguments to filter the data, in order to only collect new data. For example, you might run a first crawl to collect data until yesterday:

scrapy crawl spider_name -a until_date=2020-10-14

Then, at a later date, run a second crawl to collect data from the day after until yesterday:

scrapy crawl spider_name -a from_date=2020-10-15 -a until_date=2020-10-31

And so on. However, as you learned in How it works, each crawl writes data to a separate directory. By default, this directory is named according to the time at which you started the crawl. To collect the incremental data into the same directory, you can take the time from the first crawl’s directory name, then override the time of subsequent crawls with the crawl_time spider argument:

scrapy crawl spider_name -a from_date=2020-10-15 -a until_date=2020-10-31 -a crawl_time=2020-10-14T12:34:56

If you are integrating with Kingfisher Process, remember to set the keep_collection_open spider argument to 'true', in order to not close the collection when the crawl is finished:

scrapy crawl spider_name -a keep_collection_open=true

See also

DatabaseStore extension

Use a proxy

Note

This is an advanced topic. In most cases, you will not need to use this feature.

If the data source is blocking Scrapy’s requests, you might need to use a proxy.

To use an HTTP and/or HTTPS proxy, set the http_proxy and/or https_proxy environment variables, and override the HTTPPROXY_ENABLED Scrapy setting:

env http_proxy=YOUR-PROXY-URL https_proxy=YOUR-PROXY-URL scrapy crawl spider_name -s HTTPPROXY_ENABLED=True

Use data

You should now have a crawl directory within the data directory containing OCDS files. The log file indicates the path to that crawl directory. For example:

2023-01-05 09:12:14 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-05 09:12:14 [argentina_vialidad] INFO: +-------------------------- DATA DIRECTORY --------------------------+
2023-01-05 09:12:14 [argentina_vialidad] INFO: |                                                                    |
2023-01-05 09:12:14 [argentina_vialidad] INFO: | The data is available at: data/argentina_vialidad/20230105_121201  |
2023-01-05 09:12:14 [argentina_vialidad] INFO: |                                                                    |
2023-01-05 09:12:14 [argentina_vialidad] INFO: +--------------------------------------------------------------------+
2023-01-05 09:12:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 267,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
...

If no data was downloaded, the log file contains a message like:

2023-01-05 09:14:43 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-05 09:14:43 [argentina_vialidad] INFO: +---------------- DATA DIRECTORY ----------------+
2023-01-05 09:14:43 [argentina_vialidad] INFO: |                                                |
2023-01-05 09:14:43 [argentina_vialidad] INFO: | Something went wrong. No data was downloaded.  |
2023-01-05 09:14:43 [argentina_vialidad] INFO: |                                                |
2023-01-05 09:14:43 [argentina_vialidad] INFO: +------------------------------------------------+

You can check the Log files page to learn how to interpret the log file.

For help using data, read about using open contracting data.