Features
The two main features of scrachy are a cache storage backend that writes to a relational database using SqlAlchemy and a middleware that allow downloading requests using Selenium. It also provides a number of other related features which are described below.
See the API documentation
for a complete list of available settings.
Cache Storage
SqlAlchemy Backend
AlchemyCacheStorage
is a cache storage that stores the responses in a database via SqlAlchemy.
It behaves similar to DbmCacheStorage
, with some notable differences.
It stores the data as text instead of bytes, which limits it applicability to subclasses of
TextResponse
.It’s not clear why the Scrapy backends perform their own expiration management aside from the configured cache policy, but it does and so does Scrachy. However, Scrachy uses a more complicated process for determining if a response is stale, which is described in the expiration management section.
You can optionally extract the textual content from the page described in the content extraction section.
It is possible to save the full scrape history.
The middleware returns a subclass of TextResponse
with a CachedResponseMixin
.
The mixin adds the following attributes (which may or may not be set depending on the Settings):
- scrape_timestamp
The most recent time the page was scraped.
- extracted_text
The text extracted from page using a
ContentExtractor
.- body_length
The number of bytes in the response body
- extracted_text_length
The number of bytes in the extracted text
- scrape_history
A list of
scrachy.db.models.ScrapeHistory
objects contain the response body each time the page was scraped.
Activate the middleware as follows:
HTTPCACHE_STORAGE = 'scrachy.middleware.httpcache.AlchemyCacheStorage`
Settings
SCRACHY_CACHE_EXPIRATION_SCHEDULE_PATTERNS: Optional[list[tuple[PatternLike, Schedulable]]] = []
"""
Expire any response who's URL matches the given pattern according to the
corresponding schedule.
"""
# Encoding ####################################################################
SCRACHY_CACHE_DEFAULT_ENCODING: str = 'utf-8'
"""
Sometimes it is not possible to determine the encoding of a page because it was
not set properly at the source. But this also seems to happen for compressed
pages which have an encoding based on the compression algorithm (e.g., gzip).
However, Scrapy will raise an exception when constructing a
:class:`scrapy.http.TextResponse` if it can't determine the encoding.
To avoid these issues you can specify a default encoding to use when Scrapy
fails to automatically identify a compatible one.
"""
# Retrieval ###################################################################
SCRACHY_CACHE_RESPONSE_RETRIEVAL_METHOD: RetrievalMethod = 'standard'
"""
The cache stores quite a bit of information about each response. Not all of this
information is useful for a given scraping task or might only be used for post
scraping analysis. To help avoid loading unnecessary information you can select
one of three retrieval methods that vary in the amount of data they retrieve.
All three methods return some subclass of
:class:`~scrapy.http.TextResponse` object, but may have ``null`` values for
some of the properties.
"""
# Database Settings ###########################################################
SCRACHY_DB_DIALECT: str = 'sqlite'
"""
This specifies the database dialect to use and must be supported by
`SQLAlchemy <https://docs.sqlalchemy.org/en/20/dialects/>`_
"""
SCRACHY_DB_DRIVER: Optional[str] = None
"""
This specifies the name of the driver used to connect to the database. It must
be a name recognized by
`SQLAlchemy <https://docs.sqlalchemy.org/en/13/core/engines.html#supported-databases>`_
or ``None`` to use the default driver. Note, the selected driver (including the
default) must be installed separately prior to using it.
"""
SCRACHY_DB_HOST: Optional[str] = None
"""
The hostname (or ip address) where the database server is running. This should
be ``None`` for sqlite databases. For other databases, the hostname is assumed
to be ``localhost`` if this setting is ``None``.
"""
SCRACHY_DB_PORT: Optional[int] = None
"""
The port number the database server is listening on. This should be ``None``
for sqlite databases. For other databases, the default port for the database
server is used when this setting is ``None``.
"""
SCRACHY_DB_DATABASE: Optional[str] = None
"""
For sqlite this is the path to the database file and it will be created if it
does not already exist. For other dialects this is the name of the database
where the cached items will be stored. The database must exist prior to running
any crawlers, but the backend will create all necessary tables. This requires
that the database user have sufficient privileges to do so. If the value is
``None`` for the sqlite dialect, an in memory database will be used (which is
probably not what you want). For all other dialects ``None`` is not permitted.
"""
SCRACHY_DB_SCHEMA: Optional[str] = None
"""
This will set the schema for databases that support them (e.g., PostgreSQL).
"""
SCRACHY_DB_USERNAME: Optional[str] = None
"""
The username used to connect to the database.
"""
SCRACHY_DB_PASSWORD: Optional[str] = None
"""
The password (if any) used to connect to the database. It is not recommended to
store this directly in the settings file. Instead, it should be loaded
dynamically, e.g., using environment variables or ``python-dotenv``.
"""
SCRACHY_DB_CONNECT_ARGS: dict[str, Any] = dict()
"""
Any other arguments that should be passed to :func:`sqla.create_engine`. For
example, you could use the following ``dict`` to connect to postgresql using
ssl:
.. code-block::
{
sslrootcert: "path.to.rootcert",
sslcert: "path.to.clientcert",
sslkey: "path.to.clientkey",
sslmode: "verify-full"
}
"""
Expiration Management
In scrapy a cached response becomes stale if it has been in the cache longer than EXPIRATION_SECS
seconds.
Scrachy adds additional functionality to control when items are considered stale.
The 3 primary ways an item is considered stale are:
- expiration
A response has been in the cache for too long.
- activation
A response has not been in the cache long enough.
- schedule
A response expires at a specific time or date using cron semantics.
For each of these 3 methods a response can be marked stale via a global setting or by a pattern matching its URL. A request is considered fresh (i.e., not stale) if:
It has been in the cache longer than its activation period.
It has been in the cache less than its expiration period.
The scrape time is less than the expiration date derived from the schedule.
Settings
# 3rd Party Modules
BeautifulSoupParser = Literal['html.parser', 'lxml', 'lxml-xml', 'html5lib']
RetrievalMethod = Literal['minimal', 'standard', 'full']
PatternLike = str | re.Pattern
Schedulable = str | Cron
# Expiration ##################################################################
SCRACHY_CACHE_ACTIVATION_SECS: float = 0
"""
Consider any page that is in the cache stale (do not retrieve it) if it has
not been in the cache for at least this many seconds. This might be used
for sites that initially post unreliable or partial data then update it
with better data after some period of time but then rarely change it again.
"""
SCRACHY_CACHE_ACTIVATION_SECS_PATTERNS: list[tuple[PatternLike, float]] = []
"""
A list of tuples consisting of a pattern and a delay time in seconds. The
pattern should either be a :class:`re.Pattern` or a string that can be
compiled to one. Any url that matches this pattern will use the value in
the second element of the tuple as its activation delay.
See: :const:`SCRACHY_CACHE_ACTIVATION_SECS`.
"""
SCRACHY_CACHE_EXPIRATION_SECS_PATTERNS: list[tuple[PatternLike, float]] = []
"""
Similar to :const:`SCRACHY_CACHE_ACTIVATION_SECS_PATTERNS`, but overrides
``HTTPCACHE_EXPIRATION_SECS`` for matching urls.
"""
SCRACHY_CACHE_EXPIRATION_SCHEDULE: Optional[Schedulable] = None
"""
Expire all responses that do not match a
:const:`schedule pattern <SCRACHY_CACHE_EXPIRATION_SCHEDULE_PATTERNS>` in the
cache according to this schedule.
"""
Content Extraction
Scrachy can try to extract the textual content from the page and store it along with the full body using a scrachy.content.ContentExtractor
.
Scrachy comes with two content extractors.
scrachy.content.bs4.BeautifulSoupExtractor
This uses Beautiful Soup to extract the text content using a few simple rules to exclude various nodes from the DOM.
- class:
scrachy.content.boilerpipe.BoilerpipeExtractor This extractor uses BoilerPy3 to try to remove boilerplate elements, such as headers, footers and navigation, before extracting the text.
To activate this feature and save extracted text in the cache set the SCRACHY_CONTENT_EXTRACTOR
setting to a non None
value.
Settings
from scrachy.content import ContentExtractor
"""
Whether or not to store the full scrape history for each page (identified by
its fingerprint).
"""
# Content Extraction Settings #################################################
SCRACHY_CONTENT_EXTRACTOR: Optional[str | Type[ContentExtractor]] = None
"""
A class implementing the :class:`~scrachy.content.ContentExtractor` protocol
or an import path to a class implementing it. Scrachy provides two
implementations.
* :class:`scrachy.content.bs4.BeautifulSoupExtractor`
* :class:`scrachy.content.boilerpipe.BoilerpipeExtractor`
"""
SCRACHY_CONTENT_BS4_PARSER: Optional[BeautifulSoupParser] = 'html.parser'
"""
The
`parser <https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser>`_
to use for constructing the DOM. It can be one of the following: [html.parser,
lxml, lxml-xml, html5lib]. By default, it will use ``html.parser``, but lxml
or html5lib are probably preferred.
"""
Cache Policy
Scrachy provides an BlacklistPolicy
.
It wraps around any other scrapy cache policy as long as it accepts a Settings
object as its first (and only) constructor parameter.
URLs are never cached if they match any of the patterns specified by the SCRACHY_POLICY_EXCLUDE_URL_PATTERNS
setting.
According to this policy, an item should be cached if the base class says it should be cached and the url does not match any of the specified patterns.
Activate the policy as follows:
HTTPCACHE_POLICY = 'scrachy.middleware.httpcache.BlacklistPolicy'
Settings
# 3rd Party Modules
# Project Modules
SCRACHY_POLICY_BASE_CLASS: str | Callable = 'scrapy.extensions.httpcache.DummyPolicy'
"""
The base policy the :class:`~scrachy.middleware.httpcache.BlacklistPolicy`
will wrap around. The policy can be specified as the full import path to
the class or a class object itself. Either way the class constructor must
accept a :class:`~scrapy.settings.Settings` object as its first parameter.
"""
SCRACHY_POLICY_EXCLUDE_URL_PATTERNS: list[str | re.Pattern] = []
"""
Request Fingerprinting
Scrachy includes a more efficient RequestFingerprinter
, based on the 2.7 implementation provided by Scrapy 2.11.0.
It uses the same algorithm and data to fingerprint a request, but uses msgspec instead of json
to serialize the data and allows you to customize which hash function is used.
The use of msgspec
provides about a 30% improvement in performance while using a different hash algorithm, such as xxhash, can speed up the fingerprinting by over 50x.
However, unless you are scraping millions of pages this speedup will have little practical effect.
Activate the policy as follows:
REQUEST_FINGERPRINTER_CLASS = 'scrachy.utils.request.DynamicHashRequestFingerprinter'
Settings
# 3rd Party Modules
# Project Modules
from scrachy.utils.hash import Hasher
Response Filtering
Some scraping jobs require periodically scraping a domain looking for new content.
In these cases it is a waste of resources to reparse the existing pages.
CachedResponseFilter
is a downloader middleware that will cause requests with a fresh response in the cache to be ignored.
There are 3 ways that you can prevent a fresh cached response from being ignored.
You can choose not to ignore cached requests using 3 methods.
Set the
request.meta
attributedont_cache
isTrue
.Set the
request.meta
attributedont_filter
isTrue
If its url matches a regular expression in the settings variable
SCRACHY_CACHED_RESPONSE_FILTER_EXCLUSIONS
.
Activate this middleware as follows:
DOWNLOADER_MIDDLEWARES = {
# You probably want this early in the pipeline, because there's no point
# in the other middleware if it is in the cache and we are going to
# ignore it anyway.
'scrachy.middleware.filter.FilterCachedResponse': 50,
...
}
Settings
# Project Modules
from scrachy.settings.defaults.storage import PatternLike
Selenium
Scrachy also provides support for using Selenium to download requests. It essentially forks scrapy-selenium and adds a few enhancements.
The Scrapy Selenium Guide provides a good overview of the basic usage.
The primary difference between the implementations is how scripts
are handled.
With Scrapy-Selenium
you pass javascript directly as a string to the script
parameter of of a SeleniumRequest
.
With Scrachy
you pass a :class:~scrachy.http_.ScriptExecutor
, which is any callable that accepts a WebDriver
and a Request
as parameters.
The ScriptExecutor
can optionally return a Response
, a list[Response]
or a dict[str, Response]
, which if present will be made available in the request.meta
attribute using the key script_result
.
There are two available selenium middlewares.
scrachy.middleware.selenium.SeleniumMiddleware
This class is essentially the same as
Scrapy-Selenium
with the changes described above.scrachy.middleware.selenium.AsyncSeleniumMiddleware
This class uses twisted to spawn multiple processes to create a pool of WebDrivers that can handle requests concurrently. This can significantly increase the throughput, but is potentially less robust.
In theory, any supported WebDriver should work, but Chrome
and Firefox
are the safest bet.
Activate this middleware as follows:
DOWNLOADER_MIDDLEWARES = {
...
'scrachy.middleware.selenium.SeleniumMiddleware': 800, # or AsyncSeleniumMiddleware
...
}
Settings
# 3rd Party Modules
# Project Modules
WebDriverName = Literal['Chrome', 'ChromiumEdge', 'Firefox', 'Safari']
SCRACHY_SELENIUM_WEB_DRIVER: WebDriverName = 'Chrome'
"""
The name of the webdriver to use.
"""
SCRACHY_SELENIUM_WEB_DRIVER_OPTIONS: list[str] = []
"""
Initialize the webdriver with an ``Options`` object populated with these
options.
For a list of options see:
* Chrome: `https://peter.sh/experiments/chromium-command-line-switches/`
* Firefox: `https://www-archive.mozilla.org/docs/command-line-args`
"""
SCRACHY_SELENIUM_WEB_DRIVER_EXTENSIONS: list[str] = []
"""
A list of extensions for the webdriver to load. These should be paths to CRX
files for Chrome or XPI files for Firefox.
"""