scrachy.middleware.httpcache.AlchemyCacheStorage

class scrachy.middleware.httpcache.AlchemyCacheStorage(settings: Settings)[source]

Bases: object

This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.

Parameters:

settings – The Scrapy project middleware.

__init__(settings: Settings)[source]

This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.

Parameters:

settings – The Scrapy project middleware.

Methods

__init__(settings)

This class implements a scrapy cache storage backend that uses a relational database to store the cached documents.

clear_cache()

close_spider([spider])

Dispose of the SqlAlchemy Engine.

dump_cache()

Dump the contents of the cache.

get(name[, default])

Get a Scrachy setting without having to prefix the name with SCRACHY_.

open_spider(spider[, engine])

Connect to the database, validate the middleware and set up the database tables if necessary.

retrieve_response(spider, request)

Retrieves an item from the cache if it exists, otherwise this returns None to signal downstream processes to continue retrieving the page normally.

store_response(spider, request, response)

Stores the response in the cache.

validate_settings()

This makes sure that any setting starting with the prefix SCRACHY is known to the storage backend.

Attributes

activation_delay

bs4_parser

The parser to use for parsing HTML with BeautifulSoup.

database

The name of the database to connect to.

default_encoding

dialect

The dialect used for connecting to the database server as specified in the project middleware.

driver

The name of the driver to use with the database or None to use the default driver provided by SqlAlchemy.

engine_connect_args

expiration_secs

The value of the scrapy HTTPCACHE_EXPIRATION_SECS setting.

fingerprinter_hasher_import_path

fingerprinter_implementation

fingerprinter_import_path

host

Returns the host name specified in the project middleware.

is_scrachy_fingerprinter

is_scrapy_fingerprinter

port

The port used to connect to the database as specified in the project middleware.

response_retrieval_method

The name of the response retrieval method.

save_history

schema

The name of the schema tables will be stored in.

property bs4_parser: str

The parser to use for parsing HTML with BeautifulSoup.

Returns:

close_spider(spider: Spider | None = None)[source]

Dispose of the SqlAlchemy Engine.

Parameters:

spider – The Scrapy spider

property database: str | None

The name of the database to connect to. The only time this should be None is when using an in memory sqlite database (which mostly defeats the purpose of a cache storage engine).

Returns:

The name of the database.

property dialect: str

The dialect used for connecting to the database server as specified in the project middleware. A dialect is always required and should never be None. For supported dialects and drivers see the SQLAlchemy website.

Returns:

The dialect name.

property driver: str | None

The name of the driver to use with the database or None to use the default driver provided by SqlAlchemy.

Returns:

The driver name.

dump_cache() list[Response][source]

Dump the contents of the cache. This is not recommended except for debugging.

Returns:

A list of SQLAlchemy result objects that contains all the items in the cache.

property expiration_secs: int

The value of the scrapy HTTPCACHE_EXPIRATION_SECS setting.

Returns:

The number of seconds before the cached item becomes stale. Stale items will be re-downloaded and processed through the normal pipeline regardless if they are in the cache or not.

get(name: str, default: Any | None = None) Any | None[source]

Get a Scrachy setting without having to prefix the name with SCRACHY_.

Returns:

The value of the setting or None if it is not set.

property host: str | None

Returns the host name specified in the project middleware.

Returns:

The host name.

open_spider(spider: Spider, engine: Engine | None = None)[source]

Connect to the database, validate the middleware and set up the database tables if necessary.

Parameters:
  • spider – The Scrapy spider.

  • engine – Use this engine instead of creating a new one.

property port: int | None

The port used to connect to the database as specified in the project middleware. For sqlite this should return None. None also represents the default port for other database dialects. An error is raised if the port specified in the middleware can not be cast to an integer.

Returns:

The port.

Raises:

ValueError – If the port is not None and cannot be cast to an int.

property response_retrieval_method: Literal['minimal', 'standard', 'full']

The name of the response retrieval method.

This determines how much information to retrieve in the response.

minimal

This returns the minimal amount of information and should be the fastest because it does not require any joins. However, it will return null values for the response status and headers. Use this method of you don’t need these or the more detailed information.

standard

This returns the standard information an scrapy.http.HtmlResponse does.

full

This returns a scrachy.http.CachedResponse, which contains all the information available for an item in the cache.

Returns:

The type of response to retrieve.

retrieve_response(spider: Spider, request: Request) CachedTextResponse | CachedXmlResponse | CachedHtmlResponse | None[source]

Retrieves an item from the cache if it exists, otherwise this returns None to signal downstream processes to continue retrieving the page normally. Depending on the value of the SCRACHY_RESPONSE_RETRIEVAL_METHOD setting more or less information may be returned in the response.

Parameters:
  • spider – The Scrapy Spider requesting the data.

  • request – The request describing what information to retrieve.

Returns:

If the page is in the cache then this will return a CachedHtmlResponse, otherwise it will return None.

property schema: str | None

The name of the schema tables will be stored in.

Returns:

The name of the database.

store_response(spider: Spider, request: Request, response: Response)[source]

Stores the response in the cache.

Parameters:
  • spider – The Scrapy Spider issuing the request.

  • request – The request describing what data is desired.

  • response – The response to be stored in the cache as created by Scrapy’s standard downloading process.

validate_settings()[source]

This makes sure that any setting starting with the prefix SCRACHY is known to the storage backend.

It performs some minor validation like checking to make sure the port is an integer and a host name is specified unless the dialect is sqlite. It is still primarily up to the user to ensure the database connection properties are valid for the type of database being used.

Raises:

InvalidSettingError

If there are:

  • unknown scrachy middleware.

  • invalid database middleware.

  • an option to a setting that is not valid.

  • the hash algorithm specified in the project middleware used to create the request fingerprint is different from the one already used for this cache region.