scrachy.content.bs4.BeautifulSoupExtractor

class scrachy.content.bs4.BeautifulSoupExtractor(settings: Settings)[source]

Bases: BaseContentExtractor

A ContentExtractor that uses Beautiful Soup to process the HTML.

The SCRACHY_CONTENT_BS4_PARSER setting must be set to a valid parser name.

Parameters:

settings – The Scrapy Settings.

__init__(settings: Settings)[source]

A ContentExtractor that uses Beautiful Soup to process the HTML.

The SCRACHY_CONTENT_BS4_PARSER setting must be set to a valid parser name.

Parameters:

settings – The Scrapy Settings.

Methods

__init__(settings)

A ContentExtractor that uses Beautiful Soup to process the HTML.

get_content(html)

Extracts the textual content from the html using a simple algorithm described here.

get_content(html: str) str[source]

Extracts the textual content from the html using a simple algorithm described here. In short, it ignores blocks that are unlikely to contain meaningful content, e.g., script blocks, and then strips the tags from the remaining document.

Parameters:
  • html – The html content as text.

  • html

Returns:

Return the extracted text.

Returns: