scrachy.middleware.filter.CachedResponseFilter

class scrachy.middleware.filter.CachedResponseFilter(crawler: Crawler)[source]

Bases: object

Sometimes you scrape the same domains multiple times looking for new content. However, when crawling them you might encounter pages that you have already scraped. If your extraction rules have not changed since the last crawl it may not be worth reprocessing those pages.

This middleware will look to see if a response corresponding to this request is already in the cache and is not stale. If the response is not in the cache process_request will return immediately. Otherwise, it will use the following rules to determine whether the request should be filtered.

  • You can specify a set of patterns to match against the request url. Any pattern that matches part of the url will not be filtered regardless of whether it is in the cache or not. This might be useful after changing parsing rules for a set of pages. These are specified using the SCRACHY_CACHED_RESPONSE_FILTER_EXCLUSIONS setting, which takes a list of re.Patterns or strings which can be compiled to regular expressions.

  • Setting the request meta key, dont_filter to True, will not be processed by this middleware.

  • Any page that is already excluded from caching via the dont_cache request meta key will also never be filtered.

Any other request that has a fresh response in the cache will be filtered.

Parameters:

crawler – The current crawler.

__init__(crawler: Crawler)[source]

Sometimes you scrape the same domains multiple times looking for new content. However, when crawling them you might encounter pages that you have already scraped. If your extraction rules have not changed since the last crawl it may not be worth reprocessing those pages.

This middleware will look to see if a response corresponding to this request is already in the cache and is not stale. If the response is not in the cache process_request will return immediately. Otherwise, it will use the following rules to determine whether the request should be filtered.

  • You can specify a set of patterns to match against the request url. Any pattern that matches part of the url will not be filtered regardless of whether it is in the cache or not. This might be useful after changing parsing rules for a set of pages. These are specified using the SCRACHY_CACHED_RESPONSE_FILTER_EXCLUSIONS setting, which takes a list of re.Patterns or strings which can be compiled to regular expressions.

  • Setting the request meta key, dont_filter to True, will not be processed by this middleware.

  • Any page that is already excluded from caching via the dont_cache request meta key will also never be filtered.

Any other request that has a fresh response in the cache will be filtered.

Parameters:

crawler – The current crawler.

Methods

__init__(crawler)

Sometimes you scrape the same domains multiple times looking for new content.

from_crawler(crawler)

process_request(request, spider)

param request:

The Scrapy request.

process_request(request: Request, spider: Spider)[source]
Parameters:
  • request – The Scrapy request.

  • spider – The Scrapy Spider issuing the request.

Raises:

IgnoreRequest – If the item is already cached, and it does not meet the requirement to be excluded.