Download data to your computer#
This page will guide you through installing Kingfisher Collect and using it to collect data from data sources.
Install Kingfisher Collect#
To use Kingfisher Collect, you need access to a Unix-like shell (some are available for Windows). Git and Python (version 3.10 or greater) must be installed.
When ready, open a shell, and run:
git clone https://github.com/open-contracting/kingfisher-collect.git
cd kingfisher-collect
pip install -r requirements.txt
Tip
If you encounter an error relating to psycopg2
, try instead:
pip install psycopg2-binary -r requirements.txt
To extract data from RAR files, you must have the unrar
or unar command-line utility.
The next steps assume that you have changed to the kingfisher-collect
directory (the cd
command above).
Configure Kingfisher Collect#
Note
This step is optional.
To use a different directory than the default data
directory to store files, change the FILES_STORE
variable in the kingfisher_scrapy/settings.py
file. It can be a relative path (like data
) or an absolute path (like /home/user/path
).
FILES_STORE = '/home/user/path'
Collect data#
You’re now ready to collect data!
To list the spiders, run:
scrapy list
The spiders’ names might be ambiguous. If you’re unsure which spider to run, you can find more information about them on the Spiders page, or contact the Data Support Team.
To run a spider (that is, to start a “crawl”), replace spider_name
below with the name of a spider from scrapy list
above:
scrapy crawl spider_name
Download a sample#
To download only a sample of the available data, set the sample size with the sample
spider argument:
scrapy crawl spider_name -a sample=10
Scrapy will then output a log of its activity.
Note
_sample
will be added to the spider’s directory, e.g. kingfisher-collect/data/zambia_sample
.
Filter the data#
Each spider supports different filters, which you can set as spider arguments. For example:
scrapy crawl colombia -a from_date=2015-01-01
You can find which filters a spider supports on the Spiders page.
Not all of an API’s features are exposed by Kingfisher Collect. Each spider links to its API documentation in its metadata, where you can learn what filters the API supports. If the filters are implemented as query string parameters, you can apply multiple filters with, for example:
scrapy crawl spider_name -a qs:parameter1=value1 -a qs:parameter2=value2
If the filters are implemented as path parameters, you can append path components to each URL, for example:
scrapy crawl spider_name -a path=key1/value1/key2/value2/value3
Collect data incrementally#
By default, scrapy crawl
downloads all the data from the source. You can use spider arguments to filter the data, in order to only collect new data. For example, you might run a first crawl to collect data until yesterday:
scrapy crawl spider_name -a until_date=2020-10-14
Then, at a later date, run a second crawl to collect data from the day after until yesterday:
scrapy crawl spider_name -a from_date=2020-10-15 -a until_date=2020-10-31
And so on. However, as you learned in How it works, each crawl writes data to a separate directory. By default, this directory is named according to the time at which you started the crawl. To collect the incremental data into the same directory, you can take the time from the first crawl’s directory name, then override the time of subsequent crawls with the crawl_time
spider argument:
scrapy crawl spider_name -a from_date=2020-10-15 -a until_date=2020-10-31 -a crawl_time=2020-10-14T12:34:56
If you are integrating with Kingfisher Process, remember to set the keep_collection_open
spider argument, in order to not close the collection when the crawl is finished:
scrapy crawl spider_name -a keep_collection_open=true
See also
DatabaseStore
extension
Use a proxy#
Note
This is an advanced topic. In most cases, you will not need to use this feature.
If the data source is blocking Scrapy’s requests, you might need to use a proxy.
To use an HTTP and/or HTTPS proxy, set the http_proxy
and/or https_proxy
environment variables, and override the HTTPPROXY_ENABLED
Scrapy setting:
env http_proxy=YOUR-PROXY-URL https_proxy=YOUR-PROXY-URL scrapy crawl spider_name -s HTTPPROXY_ENABLED=True
Use data#
You should now have a crawl directory within the data
directory containing OCDS files.
The log file indicates the path to that crawl directory. For example:
2023-01-05 09:12:14 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-05 09:12:14 [argentina_vialidad] INFO: +-------------------------- DATA DIRECTORY --------------------------+
2023-01-05 09:12:14 [argentina_vialidad] INFO: | |
2023-01-05 09:12:14 [argentina_vialidad] INFO: | The data is available at: data/argentina_vialidad/20230105_121201 |
2023-01-05 09:12:14 [argentina_vialidad] INFO: | |
2023-01-05 09:12:14 [argentina_vialidad] INFO: +--------------------------------------------------------------------+
2023-01-05 09:12:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 267,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
...
If no data was downloaded, the log file contains a message like:
2023-01-05 09:14:43 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-05 09:14:43 [argentina_vialidad] INFO: +---------------- DATA DIRECTORY ----------------+
2023-01-05 09:14:43 [argentina_vialidad] INFO: | |
2023-01-05 09:14:43 [argentina_vialidad] INFO: | Something went wrong. No data was downloaded. |
2023-01-05 09:14:43 [argentina_vialidad] INFO: | |
2023-01-05 09:14:43 [argentina_vialidad] INFO: +------------------------------------------------+
You can check the Log files page to learn how to interpret the log file.
For help using data, read about using open contracting data.