Contributing¶
There are mainly two types of contributions: spiders and features.
Write a spider¶
Learn the data source’s access methods¶
Read its API documentation or bulk download documentation. Navigate the API, in your browser or with curl
. Inspect its responses, to determine where the OCDS data is located, and whether it includes information like pagination links, total pages or total results.
Note
Please inform the data support manager of the following, so that it can be reported as feedback to the publisher:
If there is no documentation about access methods
If the release package or record package is not at the top-level of the JSON data
Choose a spider name¶
Lowercase and join the components below with underscores. Replace any spaces with underscores.
For an access method to a jurisdiction-specific data source:
Country name. Do not use acronyms, like “uk”. If in doubt, follow ISO 3166-1. For example: “kyrgyzstan”, not “kyrgyz_republic”. For a non-country like the European Union, use the relevant geography, like “europe”.
Subdivision name. Do not use acronyms, like “nsw”. Omit the subdivision type, like “state”, unless it is typically included, like in Nigeria. If in doubt, follow ISO 3166-2.
System name, if needed. Acronyms are allowed, like “ted”.
Publisher name, if needed. Required if the publisher is not a government.
Disambiguator, if needed. For example: “historical”.
Access method, if needed: “bulk” or “api”.
OCDS format, if needed: “releases”, “records”, “release_packages” or “record_packages”.
For an access method to a multi-jurisdiction data source:
Organization name
Disambiguator
If a component repeats another, you can omit or abbreviate the component, like peru_compras
instead of peru_peru_compras
.
It is not required for the name to be minimal. For example, uganda_releases
is allowed even if there is no uganda_records
.
If you create a new base class, omit the components that are not shared, and add “base” to the end. For example, the afghanistan_packages_base.py
file contains the base class for the afghanistan_record_packages
and afghanistan_release_packages
spiders.
Note
The primary goal is for users to easily find the relevant spider. Keeping the name short and avoiding repetition is a secondary goal. For example, for mexico_veracruz_ivai
, the v
in ivai
repeats veracruz
, and for mexico_mexico_state_infoem
, the em
in infoem
repeats mexico_state
(Estado de México), but abbreviating the ivai
or infoem
acronyms would be less familiar and recognizable to users.
Choose a base class¶
Access methods for OCDS data are very similar. Spiders therefore share a lot of logic by inheriting from one of the Base Spider Classes:
IndexSpider
: Use if the API includes the total number of results or pages in its response.PeriodicSpider
: Use if the bulk downloads or API methods accept a year, year-month or date as a query string parameter or URL path component.LinksSpider
: Use if the API implements pagination and ifIndexSpider
can’t be used.CompressedFileSpider
: Use if the downloads are ZIP or RAR files.BigFileSpider
: Use if the packages are very large JSON files (e.g. over 500MB), which can cause memory issues for users.SimpleSpider
: Use in almost all other cases.IndexSpider
,PeriodicSpider
andLinksSpider
are child classes of this class.BaseSpider
: All spiders inherit, directly or indirectly, from this class, which in turn inherits from scrapy.Spider. Use if none of the above can be used.
Write the spider¶
After choosing a base class, read its documentation, as well as its parent class’ documentation. It’s also helpful to read existing spiders that inherit from the same class. A few other pointers:
Write different callback methods for different response types. Writing a single callback with many if-else branches to handle different response types is very hard to reason about.
The default
parse
callback method should be for “leaf” responses: that is, responses that cause no further requests to be yielded, besides pagination requests.Have a look at the
util
module, which contains useful functions, notablyhandle_http_error()
.
After writing the spider, add a docstring for spider metadata.
Note
If you encountered any challenges, make a note in the Caveats section of the spider metadata, and inform the data support manager, so that it can be reported as feedback to the publisher. Examples:
Requests sometimes fail (e.g. timeout or error), but succeed on retry.
Some requests always fail (e.g. for a specific date).
Error responses are returned with an HTTP 200 status code, instead of a status code in the range 400-599.
The JSON data is encoded as ISO-8859-1, instead of UTF-8 per RFC 8259.
The JSON data sometimes contains unescaped newline characters within strings.
The number of results is limited to 10,000.
If you need to use any class attributes from Scrapy – like download_delay, download_timeout, user_agent or custom_settings – add a comment to explain why the specific value was used. That way, it is easier to check whether these configurations are still needed in the future.
Since there are many class attributes that control a spider’s behavior, please put the class attributes in this order, including comments with class names:
class NewSpider(ParentSpider):
"""
The typical docstring.
"""
name = 'new_spider'
# ... Any other class attributes from Scrapy, including `download_delay`, `download_timeout`, `user_agent`, `custom_settings`
# BaseSpider
date_format = 'datetime'
default_from_date = '2000-01-01T00:00:00'
default_until_date = '2010-01-01T00:00:00'
date_required = True
dont_truncate = True
encoding = 'iso-8859-1'
concatenated_json = True
line_delimited = True
validate_json = True
root_path = 'item'
unflatten = True
unflatten_args = {}
ocds_version = '1.0'
max_attempts = 5
retry_http_codes = [429]
skip_pluck = 'A reason'
# SimpleSpider
data_type = 'release_package'
# CompressedFileSpider
resize_package = True
file_name_must_contain = '-'
# LinksSpider
formatter = staticmethod(parameters('page'))
next_pointer = '/next_page/uri'
# PeriodicSpider
formatter = staticmethod(parameters('page'))
pattern = 'https://example.com/{}'
start_requests_callback = 'parse_list'
# IndexSpider
page_count_pointer = '/data/last_page'
result_count_pointer = '/meta/count'
limit = 1000
use_page = True
start_page = 0
formatter = staticmethod(parameters('pageNumber'))
chronological_order = 'asc'
parse_list_callback = 'parse_page'
param_page = 'pageNumber'
param_limit = 'customLimit'
param_offset = = 'customOffset'
base_url = 'https://example.com/elsewhere'
# Local
# ... Any other class attributes specific to this spider.
Test the spider¶
Run the spider:
scrapy crawl spider_name
It can be helpful to write the log to a file:
scrapy crawl spider_name --logfile=debug.log
Check whether the data is as expected, in format and number
Integrate it with Kingfisher Process and check for issues
Scrapy offers some debugging features that we haven’t used yet:
Telnet console for in-progress crawls
Commit the spider¶
Update
docs/spiders.rst
with the updatedocs command:scrapy updatedocs
Check the metadata of all spiders, with the checkall command:
scrapy checkall --loglevel=WARNING
After reviewing the output, you can commit your changes to a branch and make a pull request.
Write a feature¶
Learn Scrapy¶
Read the Scrapy documentation. In particular, learn the data flow and architecture. When working on a specific feature, read the relevant documentation, for example:
Extensions and signals
The Command-line interface follows the guidance for running multiple spiders in the same process.
Use Scrapy¶
The Scrapy framework is very flexible. To maintain a good separation of concerns:
A spider’s responsibility is to collect inputs. It shouldn’t perform any slow, blocking operations like writing files. It should only:
Yield requests, to be scheduled by Scrapy’s engine
Yield items, to be sent to the item pipeline
Raise a
SpiderArgumentError
exception in its from_crawler method, if a spider argument is invalidRaise a
MissingEnvVarError
exception in its from_crawler method, if a required environment variable isn’t setRaise a
AccessTokenError
exception in a request’s callback, if the maximum number of attempts to retrieve an access token is reachedRaise any other exception, to be caught by a spider_error handler in an extension
A downloader middleware’s responsibility is to process requests yielded by the spider, before they are sent to the internet, and to process responses from the internet, before they are passed to the spider. It should only:
Return a request, for example
ParaguayAuthMiddleware
Return a Deferred, for example
DelayedRequestMiddleware
A spider middleware’s responsibility is to process items yielded by the spider. It should only yield items, for example
RootPathMiddleware
.An item pipeline’s responsibility is to clean, validate, filter, modify or substitute items. It should only:
Return an item
Raise a DropItem exception, to stop the processing of the item
Raise any other exception, to be caught by an item_error handler in an extension
An extension’s responsibility is to write outputs: for example, writing files or sending requests to external services like Kingfisher Process. It should only:
Connect signals, typically item signals and spider signals
Raise a NotConfigured exception in its from_crawler method, if a required setting isn’t set
When setting a custom Request.meta key, check that the attribute name isn’t already in use by Scrapy.
Architecture decision records (ADRs)¶
Deserialization¶
Use bytes wherever possible
Deserialize at most once
Kingfisher Collect attempts to collect the data in its original format, limiting its modifications to only those necessary to yield release packages and record packages in JSON format. Modifications include:
Extract data files from archive files (
CompressedFileSpider
)Convert CSV and XLSX bytes to JSON bytes (
Unflatten
)Transcode non-UTF-8 bytes to UTF-8 bytes (
transcode()
)Correct OCDS data to enable merging releases, like filling in the
ocid
anddate
See also
Reasons to deserialize JSON bytes include:
Perform pagination, because the API returns metadata in the response body instead of in the HTTP header (
IndexSpider
,LinksSpider
)Check whether it’s an error response, because the API returns a success status instead of an error status
Parse non-OCDS data to build URLs for OCDS data
Reasons to re-serialize JSON data include:
To reuse the
ijson.items
function (RootPathMiddleware
)
Update requirements¶
Update the requirements files as documented in the OCP Software Development Handbook.
Then, re-calculate the checksum for the requirements.txt
file. The checksum is used by deployments to determine whether to update dependencies:
shasum -a 256 requirements.txt > requirements.txt.sha256
API reference¶
- Base Spider Classes
- Base Spider
BaseSpider
BaseSpider.VALID_DATE_FORMATS
BaseSpider.date_required
BaseSpider.dont_truncate
BaseSpider.encoding
BaseSpider.concatenated_json
BaseSpider.line_delimited
BaseSpider.validate_json
BaseSpider.root_path
BaseSpider.unflatten
BaseSpider.unflatten_args
BaseSpider.ocds_version
BaseSpider.max_attempts
BaseSpider.retry_http_codes
BaseSpider.available_steps
BaseSpider.date_format
BaseSpider.from_crawler()
BaseSpider.parse_date_argument()
BaseSpider.is_http_success()
BaseSpider.is_http_retryable()
BaseSpider.get_start_time()
BaseSpider.get_retry_wait_time()
BaseSpider.build_request()
BaseSpider.build_file_from_response()
BaseSpider.build_file()
BaseSpider.build_file_item()
BaseSpider.build_file_error_from_response()
BaseSpider.get_default_until_date()
- Compressed File Spider
- Simple Spider
- Big File Spider
- Index Spider
IndexSpider
IndexSpider.use_page
IndexSpider.start_page
IndexSpider.chronological_order
IndexSpider.param_page
IndexSpider.param_limit
IndexSpider.param_offset
IndexSpider.base_url
IndexSpider.parse_list_callback
IndexSpider.parse_list()
IndexSpider.parse_list_loader()
IndexSpider.page_count_range_generator()
IndexSpider.pages_url_builder()
IndexSpider.limit_offset_range_generator()
IndexSpider.limit_offset_url_builder()
IndexSpider.result_count_range_generator()
- Links Spider
- Periodic Spider
- Base Spider
- Downloader Middlewares
- Spider Middlewares
- Item Pipelines
- Extensions
- Utilities
pluck_filename()
components()
parameters()
join()
handle_http_error()
date_range_by_interval()
date_range_by_month()
date_range_by_year()
get_parameter_value()
replace_parameters()
append_path_components()
add_query_string()
add_path_components()
items_basecoro()
items()
default()
json_dumps()
json_dump()
TranscodeFile
transcode_bytes()
transcode()
grouper()
get_file_name_and_extension()
- Exceptions