Features

The two main features of scrachy are a cache storage backend that writes to a relational database using SqlAlchemy and a middleware that allow downloading requests using Selenium. It also provides a number of other related features which are described below.

See the API documentation for a complete list of available settings.

Cache Storage

SqlAlchemy Backend

AlchemyCacheStorage is a cache storage that stores the responses in a database via SqlAlchemy. It behaves similar to DbmCacheStorage, with some notable differences.

  • It stores the data as text instead of bytes, which limits it applicability to subclasses of TextResponse.

  • It’s not clear why the Scrapy backends perform their own expiration management aside from the configured cache policy, but it does and so does Scrachy. However, Scrachy uses a more complicated process for determining if a response is stale, which is described in the expiration management section.

  • You can optionally extract the textual content from the page described in the content extraction section.

  • It is possible to save the full scrape history.

The middleware returns a subclass of TextResponse with a CachedResponseMixin. The mixin adds the following attributes (which may or may not be set depending on the Settings):

scrape_timestamp

The most recent time the page was scraped.

extracted_text

The text extracted from page using a ContentExtractor.

body_length

The number of bytes in the response body

extracted_text_length

The number of bytes in the extracted text

scrape_history

A list of scrachy.db.models.ScrapeHistory objects contain the response body each time the page was scraped.

Activate the middleware as follows:

HTTPCACHE_STORAGE = 'scrachy.middleware.httpcache.AlchemyCacheStorage`

Settings

SCRACHY_CACHE_EXPIRATION_SCHEDULE_PATTERNS: Optional[list[tuple[PatternLike, Schedulable]]] = []
"""
Expire any response who's URL matches the given pattern according to the 
corresponding schedule.
"""

# Encoding ####################################################################
SCRACHY_CACHE_DEFAULT_ENCODING: str = 'utf-8'
"""
Sometimes it is not possible to determine the encoding of a page because it was 
not set properly at the source. But this also seems to happen for compressed 
pages which have an encoding based on the compression algorithm (e.g., gzip).
However, Scrapy will raise an exception when constructing a 
:class:`scrapy.http.TextResponse` if it can't determine the encoding.
To avoid these issues you can specify a default encoding to use when Scrapy 
fails to automatically identify a compatible one.
"""

# Retrieval ###################################################################
SCRACHY_CACHE_RESPONSE_RETRIEVAL_METHOD: RetrievalMethod = 'standard'
"""
The cache stores quite a bit of information about each response. Not all of this 
information is useful for a given scraping task or might only be used for post 
scraping analysis. To help avoid loading unnecessary information you can select 
one of three retrieval methods that vary in the amount of data they retrieve. 
All three methods return some subclass of 
:class:`~scrapy.http.TextResponse` object,  but may have ``null`` values for 
some of the properties.
"""

# Database Settings ###########################################################
SCRACHY_DB_DIALECT: str = 'sqlite'
"""
This specifies the database dialect to use and must be supported by
`SQLAlchemy <https://docs.sqlalchemy.org/en/20/dialects/>`_
"""

SCRACHY_DB_DRIVER: Optional[str] = None
"""
This specifies the name of the driver used to connect to the database. It must 
be a name recognized by 
`SQLAlchemy <https://docs.sqlalchemy.org/en/13/core/engines.html#supported-databases>`_ 
or ``None`` to use the default driver. Note, the selected driver (including the 
default) must be installed separately prior to using it.
"""

SCRACHY_DB_HOST: Optional[str] = None
"""
The hostname (or ip address) where the database server is running. This should 
be ``None`` for sqlite databases. For other databases, the hostname is assumed 
to be ``localhost`` if this setting is ``None``.
"""

SCRACHY_DB_PORT: Optional[int] = None
"""
The port number the database server is listening on. This should be ``None`` 
for sqlite databases. For other databases, the default port for the database 
server is used when this setting is ``None``.
"""

SCRACHY_DB_DATABASE: Optional[str] = None
"""
For sqlite this is the path to the database file and it will be created if it 
does not already exist. For other dialects this is the name of the database 
where the cached items will be stored. The database must exist prior to running 
any crawlers, but the backend will create all necessary tables. This requires 
that the database user have sufficient privileges to do so. If the value is 
``None`` for the sqlite dialect, an in memory database will be used (which is 
probably not what you want). For all other dialects ``None`` is not permitted.
"""

SCRACHY_DB_SCHEMA: Optional[str] = None
"""
This will set the schema for databases that support them (e.g., PostgreSQL).
"""

SCRACHY_DB_USERNAME: Optional[str] = None
"""
The username used to connect to the database.
"""

SCRACHY_DB_PASSWORD: Optional[str] = None
"""
The password (if any) used to connect to the database. It is not recommended to 
store this directly in the settings file. Instead, it should be loaded 
dynamically, e.g., using environment variables or ``python-dotenv``.
"""

SCRACHY_DB_CONNECT_ARGS: dict[str, Any] = dict()
"""
Any other arguments that should be passed to :func:`sqla.create_engine`. For 
example, you could use the following ``dict`` to connect to postgresql using 
ssl:

.. code-block::

    {
        sslrootcert: "path.to.rootcert",
        sslcert: "path.to.clientcert",
        sslkey: "path.to.clientkey",
        sslmode: "verify-full"
    }
"""

Expiration Management

In scrapy a cached response becomes stale if it has been in the cache longer than EXPIRATION_SECS seconds. Scrachy adds additional functionality to control when items are considered stale. The 3 primary ways an item is considered stale are:

expiration

A response has been in the cache for too long.

activation

A response has not been in the cache long enough.

schedule

A response expires at a specific time or date using cron semantics.

For each of these 3 methods a response can be marked stale via a global setting or by a pattern matching its URL. A request is considered fresh (i.e., not stale) if:

  1. It has been in the cache longer than its activation period.

  2. It has been in the cache less than its expiration period.

  3. The scrape time is less than the expiration date derived from the schedule.

Settings

# 3rd Party Modules

BeautifulSoupParser = Literal['html.parser', 'lxml', 'lxml-xml', 'html5lib']
RetrievalMethod = Literal['minimal', 'standard', 'full']
PatternLike = str | re.Pattern
Schedulable = str | Cron

# Expiration ##################################################################
SCRACHY_CACHE_ACTIVATION_SECS: float = 0
"""
Consider any page that is in the cache stale (do not retrieve it) if it has
not been in the cache for at least this many seconds. This might be used 
for sites that initially post unreliable or partial data then update it 
with better data after some period of time but then rarely change it again.
"""

SCRACHY_CACHE_ACTIVATION_SECS_PATTERNS: list[tuple[PatternLike, float]] = []
"""
A list of tuples consisting of a pattern and a delay time in seconds. The
pattern should either be a :class:`re.Pattern` or a string that can be
compiled to one. Any url that matches this pattern will use the value in
the second element of the tuple as its activation delay.

See: :const:`SCRACHY_CACHE_ACTIVATION_SECS`.
"""

SCRACHY_CACHE_EXPIRATION_SECS_PATTERNS: list[tuple[PatternLike, float]] = []
"""
Similar to :const:`SCRACHY_CACHE_ACTIVATION_SECS_PATTERNS`, but overrides
``HTTPCACHE_EXPIRATION_SECS`` for matching urls.
"""

SCRACHY_CACHE_EXPIRATION_SCHEDULE: Optional[Schedulable] = None
"""
Expire all responses that do not match a 
:const:`schedule pattern <SCRACHY_CACHE_EXPIRATION_SCHEDULE_PATTERNS>` in the 
cache  according to this schedule.
"""

Content Extraction

Scrachy can try to extract the textual content from the page and store it along with the full body using a scrachy.content.ContentExtractor.

Scrachy comes with two content extractors.

scrachy.content.bs4.BeautifulSoupExtractor

This uses Beautiful Soup to extract the text content using a few simple rules to exclude various nodes from the DOM.

class:

scrachy.content.boilerpipe.BoilerpipeExtractor This extractor uses BoilerPy3 to try to remove boilerplate elements, such as headers, footers and navigation, before extracting the text.

To activate this feature and save extracted text in the cache set the SCRACHY_CONTENT_EXTRACTOR setting to a non None value.

Settings

from scrachy.content import ContentExtractor
"""
Whether or not to store the full scrape history for each page (identified by
its fingerprint).
"""

# Content Extraction Settings #################################################
SCRACHY_CONTENT_EXTRACTOR: Optional[str | Type[ContentExtractor]] = None
"""
A class implementing the :class:`~scrachy.content.ContentExtractor` protocol 
or an import path to a class implementing it. Scrachy provides two 
implementations.

    * :class:`scrachy.content.bs4.BeautifulSoupExtractor`
    * :class:`scrachy.content.boilerpipe.BoilerpipeExtractor`
"""

SCRACHY_CONTENT_BS4_PARSER: Optional[BeautifulSoupParser] = 'html.parser'
"""
The 
`parser <https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser>`_
to use for constructing the DOM. It can be one of the following: [html.parser, 
lxml, lxml-xml, html5lib]. By default, it will use ``html.parser``, but lxml 
or html5lib are probably preferred.
"""

Cache Policy

Scrachy provides an BlacklistPolicy. It wraps around any other scrapy cache policy as long as it accepts a Settings object as its first (and only) constructor parameter.

URLs are never cached if they match any of the patterns specified by the SCRACHY_POLICY_EXCLUDE_URL_PATTERNS setting. According to this policy, an item should be cached if the base class says it should be cached and the url does not match any of the specified patterns.

Activate the policy as follows:

HTTPCACHE_POLICY = 'scrachy.middleware.httpcache.BlacklistPolicy'

Settings

# 3rd Party Modules

# Project Modules


SCRACHY_POLICY_BASE_CLASS: str | Callable = 'scrapy.extensions.httpcache.DummyPolicy'
"""
The base policy the :class:`~scrachy.middleware.httpcache.BlacklistPolicy` 
will wrap around. The policy can be specified as the full import path to
the class or a class object itself. Either way the class constructor must 
accept a :class:`~scrapy.settings.Settings` object as its first parameter. 
"""

SCRACHY_POLICY_EXCLUDE_URL_PATTERNS: list[str | re.Pattern] = []
"""

Request Fingerprinting

Scrachy includes a more efficient RequestFingerprinter, based on the 2.7 implementation provided by Scrapy 2.11.0. It uses the same algorithm and data to fingerprint a request, but uses msgspec instead of json to serialize the data and allows you to customize which hash function is used. The use of msgspec provides about a 30% improvement in performance while using a different hash algorithm, such as xxhash, can speed up the fingerprinting by over 50x. However, unless you are scraping millions of pages this speedup will have little practical effect.

Activate the policy as follows:

REQUEST_FINGERPRINTER_CLASS = 'scrachy.utils.request.DynamicHashRequestFingerprinter'

Settings

# 3rd Party Modules

# Project Modules
from scrachy.utils.hash import Hasher

Response Filtering

Some scraping jobs require periodically scraping a domain looking for new content. In these cases it is a waste of resources to reparse the existing pages. CachedResponseFilter is a downloader middleware that will cause requests with a fresh response in the cache to be ignored. There are 3 ways that you can prevent a fresh cached response from being ignored.

You can choose not to ignore cached requests using 3 methods.

  1. Set the request.meta attribute dont_cache is True.

  2. Set the request.meta attribute dont_filter is True

  3. If its url matches a regular expression in the settings variable SCRACHY_CACHED_RESPONSE_FILTER_EXCLUSIONS.

Activate this middleware as follows:

DOWNLOADER_MIDDLEWARES = {
    # You probably want this early in the pipeline, because there's no point
    # in the other middleware if it is in the cache and we are going to
    # ignore it anyway.
    'scrachy.middleware.filter.FilterCachedResponse': 50,
    ...
}

Settings


# Project Modules
from scrachy.settings.defaults.storage import PatternLike

Selenium

Scrachy also provides support for using Selenium to download requests. It essentially forks scrapy-selenium and adds a few enhancements.

The Scrapy Selenium Guide provides a good overview of the basic usage. The primary difference between the implementations is how scripts are handled. With Scrapy-Selenium you pass javascript directly as a string to the script parameter of of a SeleniumRequest. With Scrachy you pass a :class:~scrachy.http_.ScriptExecutor, which is any callable that accepts a WebDriver and a Request as parameters. The ScriptExecutor can optionally return a Response, a list[Response] or a dict[str, Response], which if present will be made available in the request.meta attribute using the key script_result.

There are two available selenium middlewares.

scrachy.middleware.selenium.SeleniumMiddleware

This class is essentially the same as Scrapy-Selenium with the changes described above.

scrachy.middleware.selenium.AsyncSeleniumMiddleware

This class uses twisted to spawn multiple processes to create a pool of WebDrivers that can handle requests concurrently. This can significantly increase the throughput, but is potentially less robust.

In theory, any supported WebDriver should work, but Chrome and Firefox are the safest bet. Activate this middleware as follows:

DOWNLOADER_MIDDLEWARES = {
     ...
     'scrachy.middleware.selenium.SeleniumMiddleware': 800,  # or AsyncSeleniumMiddleware
     ...
}

Settings

# 3rd Party Modules

# Project Modules


WebDriverName = Literal['Chrome', 'ChromiumEdge', 'Firefox', 'Safari']


SCRACHY_SELENIUM_WEB_DRIVER: WebDriverName = 'Chrome'
"""
The name of the webdriver to use.
"""

SCRACHY_SELENIUM_WEB_DRIVER_OPTIONS: list[str] = []
"""
Initialize the webdriver with an ``Options`` object populated with these
options.

For a list of options see:
    
    * Chrome: `https://peter.sh/experiments/chromium-command-line-switches/`
    * Firefox: `https://www-archive.mozilla.org/docs/command-line-args`
"""

SCRACHY_SELENIUM_WEB_DRIVER_EXTENSIONS: list[str] = []
"""
A list of extensions for the webdriver to load. These should be paths to CRX
files for Chrome or XPI files for Firefox.
"""