scrachy.middleware.httpcache.AlchemyCacheStorage

class scrachy.middleware.httpcache.AlchemyCacheStorage(settings: Settings)[source]

Bases: object

This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.

Parameters:: settings – The Scrapy project middleware.

__init__(settings: Settings)[source]

This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.

Parameters:: settings – The Scrapy project middleware.

Methods

`__init__`(settings)	This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.
`clear_cache`()
`close_spider`([spider])	Dispose of the SqlAlchemy Engine.
`dump_cache`()	Dump the contents of the cache.
`get`(name[, default])	Get a `Scrachy` setting without having to prefix the name with `SCRACHY_`.
`open_spider`(spider[, engine])	Connect to the database, validate the middleware and set up the database tables if necessary.
`retrieve_response`(spider, request)	Retrieves an item from the cache if it exists, otherwise this returns `None` to signal downstream processes to continue retrieving the page normally.
`store_response`(spider, request, response)	Stores the response in the cache.
`validate_settings`()	This makes sure that any setting starting with the prefix `SCRACHY` is known to the storage backend.

Attributes

`activation_delay`
`bs4_parser`	The parser to use for parsing HTML with BeautifulSoup.
`database`	The name of the database to connect to.
`default_encoding`
`dialect`	The dialect used for connecting to the database server as specified in the project middleware.
`driver`	The name of the driver to use with the database or `None` to use the default driver provided by SqlAlchemy.
`engine_connect_args`
`expiration_secs`	The value of the scrapy HTTPCACHE_EXPIRATION_SECS setting.
`fingerprinter_hasher_import_path`
`fingerprinter_implementation`
`fingerprinter_import_path`
`host`	Returns the host name specified in the project middleware.
`is_scrachy_fingerprinter`
`is_scrapy_fingerprinter`
`port`	The port used to connect to the database as specified in the project middleware.
`response_retrieval_method`	The name of the response retrieval method.
`save_history`
`schema`	The name of the schema tables will be stored in.

property bs4_parser: str

The parser to use for parsing HTML with BeautifulSoup.

Returns:

close_spider(spider: Spider | None = None)[source]

Dispose of the SqlAlchemy Engine.

Parameters:: spider – The Scrapy spider

property database: str | None

The name of the database to connect to. The only time this should be None is when using an in memory sqlite database (which mostly defeats the purpose of a cache storage engine).

Returns:: The name of the database.

property dialect: str

The dialect used for connecting to the database server as specified in the project middleware. A dialect is always required and should never be None. For supported dialects and drivers see the SQLAlchemy website.

Returns:: The dialect name.

property driver: str | None

The name of the driver to use with the database or None to use the default driver provided by SqlAlchemy.

Returns:: The driver name.

dump_cache() → list[Response][source]

Dump the contents of the cache. This is not recommended except for debugging.

Returns:: A list of SQLAlchemy result objects that contains all the items in the cache.

property expiration_secs: int

The value of the scrapy HTTPCACHE_EXPIRATION_SECS setting.

Returns:: The number of seconds before the cached item becomes stale. Stale items will be re-downloaded and processed through the normal pipeline regardless if they are in the cache or not.

get(name: str, default: Any | None = None) → Any | None[source]

Get a Scrachy setting without having to prefix the name with SCRACHY_.

Returns:: The value of the setting or None if it is not set.

property host: str | None

Returns the host name specified in the project middleware.

Returns:: The host name.

open_spider(spider: Spider, engine: Engine | None = None)[source]

Connect to the database, validate the middleware and set up the database tables if necessary.

Parameters:

spider – The Scrapy spider.
engine – Use this engine instead of creating a new one.

property port: int | None

The port used to connect to the database as specified in the project middleware. For sqlite this should return None. None also represents the default port for other database dialects. An error is raised if the port specified in the middleware can not be cast to an integer.

Returns:: The port.
Raises:: ValueError – If the port is not None and cannot be cast to an int.

property response_retrieval_method: Literal['minimal', 'standard', 'full']

The name of the response retrieval method.

This determines how much information to retrieve in the response.

minimal: This returns the minimal amount of information and should be the fastest because it does not require any joins. However, it will return null values for the response status and headers. Use this method of you don’t need these or the more detailed information.
standard: This returns the standard information an scrapy.http.HtmlResponse does.
full: This returns a scrachy.http.CachedResponse, which contains all the information available for an item in the cache.

Returns:: The type of response to retrieve.

retrieve_response(spider: Spider, request: Request) → CachedTextResponse | CachedXmlResponse | CachedHtmlResponse | None[source]

Retrieves an item from the cache if it exists, otherwise this returns None to signal downstream processes to continue retrieving the page normally. Depending on the value of the SCRACHY_RESPONSE_RETRIEVAL_METHOD setting more or less information may be returned in the response.

Parameters:

spider – The Scrapy Spider requesting the data.
request – The request describing what information to retrieve.

Returns:

If the page is in the cache then this will return a CachedHtmlResponse, otherwise it will return None.

property schema: str | None

The name of the schema tables will be stored in.

Returns:: The name of the database.

store_response(spider: Spider, request: Request, response: Response)[source]

Stores the response in the cache.

Parameters:

spider – The Scrapy Spider issuing the request.
request – The request describing what data is desired.
response – The response to be stored in the cache as created by Scrapy’s standard downloading process.

validate_settings()[source]

This makes sure that any setting starting with the prefix SCRACHY is known to the storage backend.

It performs some minor validation like checking to make sure the port is an integer and a host name is specified unless the dialect is sqlite. It is still primarily up to the user to ensure the database connection properties are valid for the type of database being used.

Raises:

InvalidSettingError –

If there are:

unknown scrachy middleware.
invalid database middleware.
an option to a setting that is not valid.
the hash algorithm specified in the project middleware used to create the request fingerprint is different from the one already used for this cache region.