scrachy.content.BaseContentExtractor

class scrachy.content.BaseContentExtractor(settings: Settings)[source]

Bases: ContentExtractor

A content extractor base class that keeps track of the project middleware.

Parameters:

settings – The Scrapy Settings.

__init__(settings: Settings)[source]

A content extractor base class that keeps track of the project middleware.

Parameters:

settings – The Scrapy Settings.

Methods

__init__(settings)

A content extractor base class that keeps track of the project middleware.

get_content(html)

Get the desired textual content from the HTML.

get_content(html: str) str

Get the desired textual content from the HTML.

Parameters:

html – The textual HTML to process.

Returns:

The desired content (e.g., text with tags removed).