scrachy.content.boilerpipe.BoilerpipeExtractor

class scrachy.content.boilerpipe.BoilerpipeExtractor(settings: Settings)[source]

Bases: BaseContentExtractor

A ContentExtractor that uses BoilerPy3 to process the HTML.

The SCRACHY_BOILERPY_EXTRACTOR must be set to a valid extractor.

Parameters:

settings – The settings to use for initialization.

__init__(settings: Settings)[source]

A ContentExtractor that uses BoilerPy3 to process the HTML.

The SCRACHY_BOILERPY_EXTRACTOR must be set to a valid extractor.

Parameters:

settings – The settings to use for initialization.

Methods

__init__(settings)

A ContentExtractor that uses BoilerPy3 to process the HTML.

cleanup(text)

This applies a few simple rules to clean up the text extracted from an html document.

get_content(html)

Get the desired textual content from the HTML.

static cleanup(text: str) str[source]

This applies a few simple rules to clean up the text extracted from an html document.

  1. Split on line breaks.

  2. Strip whitespace from the beginning and ending of each line.

  3. Replace all continuous sequences of whitespace characters with a single space.

  4. Remove any empty lines.

Parameters:

text – The content to clean up.

Returns:

The text after applying these rules.

get_content(html: str) str[source]

Get the desired textual content from the HTML.

Parameters:

html – The textual HTML to process.

Returns:

The desired content (e.g., text with tags removed).