scrachy.middleware.filter.CachedResponseFilter
- class scrachy.middleware.filter.CachedResponseFilter(crawler: Crawler)[source]
Bases:
objectSometimes you scrape the same domains multiple times looking for new content. However, when crawling them you might encounter pages that you have already scraped. If your extraction rules have not changed since the last crawl it may not be worth reprocessing those pages.
This middleware will look to see if a response corresponding to this request is already in the cache and is not stale. If the response is not in the cache
process_requestwill return immediately. Otherwise, it will use the following rules to determine whether the request should be filtered.You can specify a set of patterns to match against the request url. Any pattern that matches part of the url will not be filtered regardless of whether it is in the cache or not. This might be useful after changing parsing rules for a set of pages. These are specified using the
SCRACHY_CACHED_RESPONSE_FILTER_EXCLUSIONSsetting, which takes a list ofre.Patternsor strings which can be compiled to regular expressions.Setting the request meta key,
dont_filtertoTrue, will not be processed by this middleware.Any page that is already excluded from caching via the
dont_cacherequest meta key will also never be filtered.
Any other request that has a fresh response in the cache will be filtered.
- Parameters:
crawler – The current crawler.
- __init__(crawler: Crawler)[source]
Sometimes you scrape the same domains multiple times looking for new content. However, when crawling them you might encounter pages that you have already scraped. If your extraction rules have not changed since the last crawl it may not be worth reprocessing those pages.
This middleware will look to see if a response corresponding to this request is already in the cache and is not stale. If the response is not in the cache
process_requestwill return immediately. Otherwise, it will use the following rules to determine whether the request should be filtered.You can specify a set of patterns to match against the request url. Any pattern that matches part of the url will not be filtered regardless of whether it is in the cache or not. This might be useful after changing parsing rules for a set of pages. These are specified using the
SCRACHY_CACHED_RESPONSE_FILTER_EXCLUSIONSsetting, which takes a list ofre.Patternsor strings which can be compiled to regular expressions.Setting the request meta key,
dont_filtertoTrue, will not be processed by this middleware.Any page that is already excluded from caching via the
dont_cacherequest meta key will also never be filtered.
Any other request that has a fresh response in the cache will be filtered.
- Parameters:
crawler – The current crawler.
Methods
__init__(crawler)Sometimes you scrape the same domains multiple times looking for new content.
from_crawler(crawler)process_request(request, spider)- param request:
The Scrapy request.