scrachy.content.bs4.BeautifulSoupExtractor

class scrachy.content.bs4.BeautifulSoupExtractor(settings: Settings)[source]

Bases: BaseContentExtractor

A ContentExtractor that uses Beautiful Soup to process the HTML.

The SCRACHY_CONTENT_BS4_PARSER setting must be set to a valid parser name.

Parameters:: settings – The Scrapy Settings.

__init__(settings: Settings)[source]

A ContentExtractor that uses Beautiful Soup to process the HTML.

The SCRACHY_CONTENT_BS4_PARSER setting must be set to a valid parser name.

Parameters:: settings – The Scrapy Settings.

Methods

__init__(settings)

A ContentExtractor that uses Beautiful Soup to process the HTML.

get_content(html)

Extracts the textual content from the html using a simple algorithm described here.

get_content(html: str) → str[source]

Extracts the textual content from the html using a simple algorithm described here. In short, it ignores blocks that are unlikely to contain meaningful content, e.g., script blocks, and then strips the tags from the remaining document.

Parameters:

html – The html content as text.
html –

Returns:

Return the extracted text.

Returns: