scrachy.content.bs4.BeautifulSoupExtractor
- class scrachy.content.bs4.BeautifulSoupExtractor(settings: Settings)[source]
Bases:
BaseContentExtractorA
ContentExtractorthat uses Beautiful Soup to process the HTML.The
SCRACHY_CONTENT_BS4_PARSERsetting must be set to a valid parser name.- Parameters:
settings – The Scrapy
Settings.
- __init__(settings: Settings)[source]
A
ContentExtractorthat uses Beautiful Soup to process the HTML.The
SCRACHY_CONTENT_BS4_PARSERsetting must be set to a valid parser name.- Parameters:
settings – The Scrapy
Settings.
Methods
__init__(settings)A
ContentExtractorthat uses Beautiful Soup to process the HTML.get_content(html)Extracts the textual content from the html using a simple algorithm described here.
- get_content(html: str) str[source]
Extracts the textual content from the html using a simple algorithm described here. In short, it ignores blocks that are unlikely to contain meaningful content, e.g., script blocks, and then strips the tags from the remaining document.
- Parameters:
html – The html content as text.
html –
- Returns:
Return the extracted text.
- Returns: