”;
If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.
An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −
< | less than | < | < |
> | greater than | > | > |
& | ampersand | & | & |
“ | double quote | " | " |
” | single quote | ' | ' |
“ | Left Double quote | “ | “ |
“ | Right double quote | ” | ” |
£ | Pound | £ | £ |
¥ | yen | ¥ | ¥ |
€ | euro | € | € |
© | copyright | © | © |
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”
For others, they”ll be converted to Unicode characters.
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("Hello “World!”", ''html.parser'') print (str(soup))
Output
Hello "World!"
If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won”t get the HTML entities back −
Example
from bs4 import BeautifulSoup soup = BeautifulSoup("Hello “World!”", ''html.parser'') print (soup.encode())
Output
b''Hello xe2x80x9cWorld!xe2x80x9d''
To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.
formatter=”minimal” − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML
formatter=”html” − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.
formatter=”html5″ − it”s similar to formatter=”html”, but Beautiful Soup will omit the closing slash in HTML void tags like “br”
formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML
Example
from bs4 import BeautifulSoup french = "<p>Il a dit <<Sacré bleu!>></p>" soup = BeautifulSoup(french, ''html.parser'') print ("minimal: ") print(soup.prettify(formatter="minimal")) print ("html: ") print(soup.prettify(formatter="html")) print ("None: ") print(soup.prettify(formatter=None))
Output
minimal: <p> Il a dit <<Sacré bleu!>> </p> html: <p> Il a dit <<Sacré bleu!>> </p> None: <p> Il a dit <<Sacré bleu!>> </p>
In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.
HTMLFormatter class − Used to customize the formatting rules for HTML documents.
XMLFormatter class − Used to customize the formatting rules for XML documents.
”;