scrachy.middleware.httpcache.AlchemyCacheStorage
- class scrachy.middleware.httpcache.AlchemyCacheStorage(settings: Settings)[source]
Bases:
objectThis class implements a scrapy cache storage backend that uses a relational database to store the cached documents.
- Parameters:
settings – The Scrapy project middleware.
- __init__(settings: Settings)[source]
This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.
- Parameters:
settings – The Scrapy project middleware.
Methods
__init__(settings)This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.
clear_cache()close_spider([spider])Dispose of the SqlAlchemy Engine.
Dump the contents of the cache.
get(name[, default])Get a
Scrachysetting without having to prefix the name withSCRACHY_.open_spider(spider[, engine])Connect to the database, validate the middleware and set up the database tables if necessary.
retrieve_response(spider, request)Retrieves an item from the cache if it exists, otherwise this returns
Noneto signal downstream processes to continue retrieving the page normally.store_response(spider, request, response)Stores the response in the cache.
This makes sure that any setting starting with the prefix
SCRACHYis known to the storage backend.Attributes
activation_delayThe parser to use for parsing HTML with BeautifulSoup.
The name of the database to connect to.
default_encodingThe dialect used for connecting to the database server as specified in the project middleware.
The name of the driver to use with the database or
Noneto use the default driver provided by SqlAlchemy.engine_connect_argsThe value of the scrapy HTTPCACHE_EXPIRATION_SECS setting.
fingerprinter_hasher_import_pathfingerprinter_implementationfingerprinter_import_pathReturns the host name specified in the project middleware.
is_scrachy_fingerprinteris_scrapy_fingerprinterThe port used to connect to the database as specified in the project middleware.
The name of the response retrieval method.
save_historyThe name of the schema tables will be stored in.
- property bs4_parser: str
The parser to use for parsing HTML with BeautifulSoup.
- Returns:
- close_spider(spider: Spider | None = None)[source]
Dispose of the SqlAlchemy Engine.
- Parameters:
spider – The Scrapy spider
- property database: str | None
The name of the database to connect to. The only time this should be
Noneis when using an in memory sqlite database (which mostly defeats the purpose of a cache storage engine).- Returns:
The name of the database.
- property dialect: str
The dialect used for connecting to the database server as specified in the project middleware. A dialect is always required and should never be
None. For supported dialects and drivers see the SQLAlchemy website.- Returns:
The dialect name.
- property driver: str | None
The name of the driver to use with the database or
Noneto use the default driver provided by SqlAlchemy.- Returns:
The driver name.
- dump_cache() list[Response][source]
Dump the contents of the cache. This is not recommended except for debugging.
- Returns:
A list of SQLAlchemy result objects that contains all the items in the cache.
- property expiration_secs: int
The value of the scrapy HTTPCACHE_EXPIRATION_SECS setting.
- Returns:
The number of seconds before the cached item becomes stale. Stale items will be re-downloaded and processed through the normal pipeline regardless if they are in the cache or not.
- get(name: str, default: Any | None = None) Any | None[source]
Get a
Scrachysetting without having to prefix the name withSCRACHY_.- Returns:
The value of the setting or
Noneif it is not set.
- property host: str | None
Returns the host name specified in the project middleware.
- Returns:
The host name.
- open_spider(spider: Spider, engine: Engine | None = None)[source]
Connect to the database, validate the middleware and set up the database tables if necessary.
- Parameters:
spider – The Scrapy spider.
engine – Use this engine instead of creating a new one.
- property port: int | None
The port used to connect to the database as specified in the project middleware. For sqlite this should return
None.Nonealso represents the default port for other database dialects. An error is raised if the port specified in the middleware can not be cast to an integer.- Returns:
The port.
- Raises:
ValueError – If the port is not
Noneand cannot be cast to an int.
- property response_retrieval_method: Literal['minimal', 'standard', 'full']
The name of the response retrieval method.
This determines how much information to retrieve in the response.
- minimal
This returns the minimal amount of information and should be the fastest because it does not require any joins. However, it will return null values for the response status and headers. Use this method of you don’t need these or the more detailed information.
- standard
This returns the standard information an
scrapy.http.HtmlResponsedoes.- full
This returns a
scrachy.http.CachedResponse, which contains all the information available for an item in the cache.
- Returns:
The type of response to retrieve.
- retrieve_response(spider: Spider, request: Request) CachedTextResponse | CachedXmlResponse | CachedHtmlResponse | None[source]
Retrieves an item from the cache if it exists, otherwise this returns
Noneto signal downstream processes to continue retrieving the page normally. Depending on the value of theSCRACHY_RESPONSE_RETRIEVAL_METHODsetting more or less information may be returned in the response.- Parameters:
spider – The Scrapy Spider requesting the data.
request – The request describing what information to retrieve.
- Returns:
If the page is in the cache then this will return a
CachedHtmlResponse, otherwise it will returnNone.
- property schema: str | None
The name of the schema tables will be stored in.
- Returns:
The name of the database.
- store_response(spider: Spider, request: Request, response: Response)[source]
Stores the response in the cache.
- Parameters:
spider – The Scrapy Spider issuing the request.
request – The request describing what data is desired.
response – The response to be stored in the cache as created by Scrapy’s standard downloading process.
- validate_settings()[source]
This makes sure that any setting starting with the prefix
SCRACHYis known to the storage backend.It performs some minor validation like checking to make sure the port is an integer and a host name is specified unless the dialect is sqlite. It is still primarily up to the user to ensure the database connection properties are valid for the type of database being used.
- Raises:
If there are:
unknown scrachy middleware.
invalid database middleware.
an option to a setting that is not valid.
the hash algorithm specified in the project middleware used to create the request fingerprint is different from the one already used for this cache region.