Contributing

There are mainly two types of contributions: spiders and features.

Write a spider

Learn the data source’s access methods

Read its API documentation or bulk download documentation. Navigate the API, in your browser or with curl. Inspect its responses, to determine where the OCDS data is located, and whether it includes information like pagination links, total pages or total results.

Note

Please inform the data support manager of the following, so that it can be reported as feedback to the publisher:

  • If there is no documentation about access methods

  • If the release package or record package is not at the top-level of the JSON data

Choose a spider name

Lowercase and join the components below with underscores. Replace any spaces with underscores.

For an access method to a jurisdiction-specific data source:

  • Country name. Do not use acronyms, like “uk”. If in doubt, follow ISO 3166-1. For example: “kyrgyzstan”, not “kyrgyz_republic”. For a non-country like the European Union, use the relevant geography, like “europe”.

  • Subdivision name. Do not use acronyms, like “nsw”. Omit the subdivision type, like “state”, unless it is typically included, like in Nigeria. If in doubt, follow ISO 3166-2.

  • System name, if needed. Acronyms are allowed, like “ted”.

  • Publisher name, if needed. Required if the publisher is not a government.

  • Disambiguator, if needed. For example: “historical”.

  • Access method, if needed: “bulk” or “api”.

  • OCDS format, if needed: “releases”, “records”, “release_packages” or “record_packages”.

For an access method to a multi-jurisdiction data source:

  • Organization name

  • Disambiguator

If a component repeats another, you can omit or abbreviate the component, like peru_compras instead of peru_peru_compras.

It is not required for the name to be minimal. For example, uganda_releases is allowed even if there is no uganda_records.

If you create a new base class, omit the components that are not shared, and add “base” to the end. For example, the afghanistan_packages_base.py file contains the base class for the afghanistan_record_packages and afghanistan_release_packages spiders.

Note

The primary goal is for users to easily find the relevant spider. Keeping the name short and avoiding repetition is a secondary goal. For example, for mexico_veracruz_ivai, the v in ivai repeats veracruz, and for mexico_mexico_state_infoem, the em in infoem repeats mexico_state (Estado de México), but abbreviating the ivai or infoem acronyms would be less familiar and recognizable to users.

Choose a base class

Access methods for OCDS data are very similar. Spiders therefore share a lot of logic by inheriting from one of the Base Spider Classes:

  • IndexSpider: Use if the API includes the total number of results or pages in its response.

  • PeriodicSpider: Use if the bulk downloads or API methods accept a year, year-month or date as a query string parameter or URL path component.

  • LinksSpider: Use if the API implements pagination and if IndexSpider can’t be used.

  • CompressedFileSpider: Use if the downloads are ZIP or RAR files.

  • BigFileSpider: Use if the packages are very large JSON files (e.g. over 500MB), which can cause memory issues for users.

  • SimpleSpider: Use in almost all other cases. IndexSpider, PeriodicSpider and LinksSpider are child classes of this class.

  • BaseSpider: All spiders inherit, directly or indirectly, from this class, which in turn inherits from scrapy.Spider. Use if none of the above can be used.

Write the spider

After choosing a base class, read its documentation, as well as its parent class’ documentation. It’s also helpful to read existing spiders that inherit from the same class. A few other pointers:

  • Write different callback methods for different response types. Writing a single callback with many if-else branches to handle different response types is very hard to reason about.

  • The default parse callback method should be for “leaf” responses: that is, responses that cause no further requests to be yielded, besides pagination requests.

  • Have a look at the util module, which contains useful functions, notably handle_http_error().

After writing the spider, add a docstring for spider metadata.

Note

If you encountered any challenges, make a note in the Caveats section of the spider metadata, and inform the data support manager, so that it can be reported as feedback to the publisher. Examples:

  • Requests sometimes fail (e.g. timeout or error), but succeed on retry.

  • Some requests always fail (e.g. for a specific date).

  • Error responses are returned with an HTTP 200 status code, instead of a status code in the range 400-599.

  • The JSON data is encoded as ISO-8859-1, instead of UTF-8 per RFC 8259.

  • The JSON data sometimes contains unescaped newline characters within strings.

  • The number of results is limited to 10,000.

If you need to use any class attributes from Scrapy – like download_delay, download_timeout, user_agent or custom_settings – add a comment to explain why the specific value was used. That way, it is easier to check whether these configurations are still needed in the future.

Since there are many class attributes that control a spider’s behavior, please put the class attributes in this order, including comments with class names:

class NewSpider(ParentSpider):
   """
   The typical docstring.
   """
   name = 'new_spider'
   # ... Any other class attributes from Scrapy, including `download_delay`, `download_timeout`, `user_agent`, `custom_settings`

   # BaseSpider
   date_format = 'datetime'
   default_from_date = '2000-01-01T00:00:00'
   default_until_date = '2010-01-01T00:00:00'
   date_required = True
   dont_truncate = True
   encoding = 'iso-8859-1'
   concatenated_json = True
   line_delimited = True
   validate_json = True
   root_path = 'item'
   unflatten = True
   unflatten_args = {}
   ocds_version = '1.0'
   max_attempts = 5
   retry_http_codes = [429]
   skip_pluck = 'A reason'

   # SimpleSpider
   data_type = 'release_package'

   # CompressedFileSpider
   resize_package = True
   file_name_must_contain = '-'

   # LinksSpider
   formatter = staticmethod(parameters('page'))
   next_pointer = '/next_page/uri'

   # PeriodicSpider
   formatter = staticmethod(parameters('page'))
   pattern = 'https://example.com/{}'
   start_requests_callback = 'parse_list'

   # IndexSpider
   page_count_pointer = '/data/last_page'
   result_count_pointer = '/meta/count'
   limit = 1000
   use_page = True
   start_page = 0
   formatter = staticmethod(parameters('pageNumber'))
   chronological_order = 'asc'
   parse_list_callback = 'parse_page'
   param_page = 'pageNumber'
   param_limit = 'customLimit'
   param_offset = = 'customOffset'
   base_url = 'https://example.com/elsewhere'

   # Local
   # ... Any other class attributes specific to this spider.

Test the spider

  1. Run the spider:

    scrapy crawl spider_name
    

    It can be helpful to write the log to a file:

    scrapy crawl spider_name --logfile=debug.log
    
  2. Check the log for errors and warnings

  3. Check whether the data is as expected, in format and number

  4. Integrate it with Kingfisher Process and check for issues

Scrapy offers some debugging features that we haven’t used yet:

Commit the spider

  1. Update docs/spiders.rst with the updatedocs command:

    scrapy updatedocs
    
  2. Check the metadata of all spiders, with the checkall command:

    scrapy checkall --loglevel=WARNING
    

After reviewing the output, you can commit your changes to a branch and make a pull request.

Write a feature

Learn Scrapy

Read the Scrapy documentation. In particular, learn the data flow and architecture. When working on a specific feature, read the relevant documentation, for example:

The Command-line interface follows the guidance for running multiple spiders in the same process.

Use Scrapy

The Scrapy framework is very flexible. To maintain a good separation of concerns:

  • A spider’s responsibility is to collect inputs. It shouldn’t perform any slow, blocking operations like writing files. It should only:

    • Yield requests, to be scheduled by Scrapy’s engine

    • Yield items, to be sent to the item pipeline

    • Raise a SpiderArgumentError exception in its from_crawler method, if a spider argument is invalid

    • Raise a MissingEnvVarError exception in its from_crawler method, if a required environment variable isn’t set

    • Raise a AccessTokenError exception in a request’s callback, if the maximum number of attempts to retrieve an access token is reached

    • Raise any other exception, to be caught by a spider_error handler in an extension

  • A downloader middleware’s responsibility is to process requests yielded by the spider, before they are sent to the internet, and to process responses from the internet, before they are passed to the spider. It should only:

  • A spider middleware’s responsibility is to process items yielded by the spider. It should only yield items, for example RootPathMiddleware.

  • An item pipeline’s responsibility is to clean, validate, filter, modify or substitute items. It should only:

    • Return an item

    • Raise a DropItem exception, to stop the processing of the item

    • Raise any other exception, to be caught by an item_error handler in an extension

  • An extension’s responsibility is to write outputs: for example, writing files or sending requests to external services like Kingfisher Process. It should only:

When setting a custom Request.meta key, check that the attribute name isn’t already in use by Scrapy.

Architecture decision records (ADRs)

Deserialization

  • Use bytes wherever possible

  • Deserialize at most once

Kingfisher Collect attempts to collect the data in its original format, limiting its modifications to only those necessary to yield release packages and record packages in JSON format. Modifications include:

  • Extract data files from archive files (CompressedFileSpider)

  • Convert CSV and XLSX bytes to JSON bytes (Unflatten)

  • Transcode non-UTF-8 bytes to UTF-8 bytes (transcode())

  • Correct OCDS data to enable merging releases, like filling in the ocid and date

Reasons to deserialize JSON bytes include:

  • Perform pagination, because the API returns metadata in the response body instead of in the HTTP header (IndexSpider, class:~kingfisher_scrapy.base_spiders.links_spider.LinksSpider)

  • Check whether it’s an error response, because the API returns a success status instead of an error status

  • Parse non-OCDS data to build URLs for OCDS data

Reasons to re-serialize JSON data include:

Update requirements

Update the requirements files as documented in the OCP Software Development Handbook.

Then, re-calculate the checksum for the requirements.txt file. The checksum is used by deployments to determine whether to update dependencies:

shasum -a 256 requirements.txt > requirements.txt.sha256

API reference