scrachy.content.boilerpipe.BoilerpipeExtractor
- class scrachy.content.boilerpipe.BoilerpipeExtractor(settings: Settings)[source]
Bases:
BaseContentExtractorA
ContentExtractorthat uses BoilerPy3 to process the HTML.The
SCRACHY_BOILERPY_EXTRACTORmust be set to a valid extractor.- Parameters:
settings – The settings to use for initialization.
- __init__(settings: Settings)[source]
A
ContentExtractorthat uses BoilerPy3 to process the HTML.The
SCRACHY_BOILERPY_EXTRACTORmust be set to a valid extractor.- Parameters:
settings – The settings to use for initialization.
Methods
__init__(settings)A
ContentExtractorthat uses BoilerPy3 to process the HTML.cleanup(text)This applies a few simple rules to clean up the text extracted from an html document.
get_content(html)Get the desired textual content from the HTML.
- static cleanup(text: str) str[source]
This applies a few simple rules to clean up the text extracted from an html document.
Split on line breaks.
Strip whitespace from the beginning and ending of each line.
Replace all continuous sequences of whitespace characters with a single space.
Remove any empty lines.
- Parameters:
text – The content to clean up.
- Returns:
The text after applying these rules.