Log files

You can use Scrapy’s log files to identify and debug issues in data collection.

If you can’t debug an issue on your own, create an issue on GitHub.

1. Check for unhandled errors

Kingfisher Collect can handle errors and continue crawling. This step is to check whether any errors were unhandled.

Read the last line of the log file:

tail -n 1 logfile.log

If the line looks like:

2020-01-10 12:34:56 [scrapy.core.engine] INFO: Spider closed (REASON)

Then all errors were handled. Otherwise, either a shutdown was forced (e.g. pressing Ctrl-C twice), or an error was unhandled. You can read the lines at the end of the file for context:

less +G logfile.log

2. Check the reason for closing the spider

The last line of the log file (taken above) looked like:

2020-01-10 12:34:56 [scrapy.core.engine] INFO: Spider closed (REASON)

Reasons implemented by Scrapy’s core are:

cancelled

The spider closed for an unknown reason

finished

The crawl finished, and any errors were handled

shutdown

The crawl shutdown gracefully (e.g. pressing Ctrl-C once)

Reasons implemented by Scrapy’s extensions are:

closespider_errorcount

The spider reached the maximum error count, set by the CLOSESPIDER_ERRORCOUNT setting (the pluck and crawlall commands set it to 1, otherwise off by default)

closespider_itemcount

The spider reached the maximum item count, set by the CLOSESPIDER_ITEMCOUNT setting (off by default)

closespider_pagecount

The spider reached the maximum page count, set by the CLOSESPIDER_PAGECOUNT setting (off by default)

closespider_timeout

The spider remained open longer than the timeout, set by the CLOSESPIDER_TIMEOUT setting (off by default)

memusage_exceeded

The spider exceeded the maximum allowed memory, set by the MEMUSAGE_LIMIT_MB setting (off by default)

Reasons implemented by Kingfisher Collect are:

sample

The crawl reached the maximum sample size, when using the sample spider argument

3. Check the crawl statistics

Extract the crawl statistics:

tac logfile.log | grep -B99 statscollectors | tac

Read the numbers of error messages

  • log_count/CRITICAL

  • log_count/ERROR

  • log_count/WARNING

If there are any, filter for and read the messages, for example:

grep WARNING logfile.log

Some messages mean that action is needed. The action might be to fix a bug, or to add a try-statement to catch an exception. If you don’t know what action is needed, create an issue with the name of the spider and an excerpt of the log, including the log message and the full traceback.

CRITICAL: Unhandled error in Deferred:

An exception was raised before the spider was opened, like SpiderArgumentError, in which case the problem is in the user’s input.

ERROR: Spider error processing <GET https:…> (referer: None)

An exception was raised in the spider’s code. (See the spider_exceptions/… statistics below.)

Attention

Action needed

ERROR: Error processing {…}

An exception was raised in an item pipeline, like jsonschema.exceptions.ValidationError.

Attention

Action needed

ERROR: Error caught on signal handler: …

An exception was raised in an extension.

Attention

Action needed

ERROR: Error downloading <GET https:…>

An exception was raised by the downloader, typically after failed retries by the RetryMiddleware downloader middleware. (See the downloader/exception_type_count/… statistics below.)

ERROR: Failed to create collection: HTTP {code} ({text}) {{headers}}

Issued by the KingfisherProcessAPI2 extension.

Attention

Action needed

Run the ./manage.py load command in Kingfisher Process, once the crawl is finished.

ERROR: Failed to close collection: HTTP {code} ({text}) {{headers}}

Issued by the KingfisherProcessAPI2 extension.

Attention

Action needed

Run the ./manage.py closecollection command in Kingfisher Process.

WARNING: Dropped: Duplicate File: ‘….json’

Issued by the Validate pipeline.

Potential action

Check whether the key collisions are caused by identical items, or by different items. If by different items, the spider needs to be updated to assign keys without collisions.

WARNING: Got data loss in https:…. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False – This message won’t be shown in further requests

Issued by Scrapy if the Content-Length header doesn’t match the bytes received, after which Scrapy retries the request. If you don’t trust the Content-Length header, set to False either the DOWNLOAD_FAIL_ON_DATALOSS key of the spider’s custom_settings attribute, or the download_fail_on_dataloss key of the request’s meta attribute.

WARNING: Expected response size (987654321) larger than download warn size (123456789) in request <GET https:…>.

Issued based on the DOWNLOAD_WARNSIZE setting, download_warnsize spider attribute or download_warnsize Request.meta key. Can be ignored.

WARNING: Received more bytes than download warn size (123456789) in request <GET https:…>.

Issued based on the DOWNLOAD_WARNSIZE setting, download_warnsize spider attribute or download_warnsize Request.meta key. Can be ignored.

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘…’: …

Issued by Scrapy’s RetryMiddleware downloader middleware. Can be ignored.

Read the numbers of successful response status codes

  • downloader/response_status_count/2...

Decide whether the number is as expected. If the statistic isn’t present, there were no successful responses.

Read the numbers of error response status codes

  • downloader/response_status_count/[4-5]...

You can look up a status code’s semantics. Decide whether the numbers are elevated.

Some spiders can recover from errors, for example:

  • 401 Unauthorized: request a new access token

  • 429 Too Many Requests: back off and retry

  • 503 Service Unavailable: back off and retry

  • … or try different parameters until a request succeeds

Unrecoverable errors are yielded as FileError items (see 4. Check for FileError items).

Read the numbers of spider exceptions

  • spider_exceptions/...

If there are any, filter for and read the message(s) in which the exception is logged. (See the ERROR: Spider error processing error message above.)

Read the numbers of downloader exceptions

  • downloader/exception_type_count/...

If there are any, filter for and read the message(s) in which the exception is logged. (See the ERROR: Error downloading error message above.)

The downloader/exception_count statistic is the total number of all types of downloader exceptions.

Read the number of requests for which the maximum number of retries was reached

  • retry/max_reached

The maximum is set by the RETRY_TIMES setting or the max_retry_times Request.meta key.

If the maximum is reached, read the exceptions causing retries:

  • retry/reason_count/...

Then, filter for and read the message(s) in which the exception is logged.

Note

The following statistics are not presently collected:

httperror/response_ignored_count

Collected if the HTTPERROR_ALLOW_ALL setting is False.

scheduler/unserializable

Collected if the SCHEDULER_DEBUG setting is True.

Read the number of duplicate requests

  • dupefilter/filtered

Spiders should not send duplicate requests. A duplicate request might indicate a bug in the spider’s implementation.

Presently, only the first duplicate request is logged. The line looks like:

2020-01-10 12:34:56 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://...> (referer: http://...)

4. Check for FileError items

Kingfisher Collect yields some errors as FileError items. You can open the log file and search for 'errors': to get more context on each error.