Beautiful Soup – Output Formatting


Beautiful Soup – Output Formatting



”;


If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters.

An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are −













< less than &lt; &#60;
> greater than &gt; &#62;
& ampersand &amp; &#38;
double quote &quot; &#34;
single quote &apos; &#39;
Left Double quote &ldquo; &#8220;
Right double quote &rdquo; &#8221;
£ Pound &pound; &#163;
¥ yen &yen; &#165;
euro &euro; &#8364;
© copyright &copy; &#169;

By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&amp;”, “&lt;”, and “&gt;”

For others, they”ll be converted to Unicode characters.

Example


from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", ''html.parser'')
print (str(soup))

Output


Hello "World!"

If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won”t get the HTML entities back −

Example


from bs4 import BeautifulSoup

soup = BeautifulSoup("Hello “World!”", ''html.parser'')
print (soup.encode())

Output


b''Hello xe2x80x9cWorld!xe2x80x9d''

To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter.

formatter=”minimal” − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML

formatter=”html” − Beautiful Soup will convert Unicode characters to HTML entities whenever possible.

formatter=”html5″ − it”s similar to formatter=”html”, but Beautiful Soup will omit the closing slash in HTML void tags like “br”

formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML

Example


from bs4 import BeautifulSoup

french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french, ''html.parser'')
print ("minimal: ")
print(soup.prettify(formatter="minimal"))
print ("html: ")
print(soup.prettify(formatter="html"))
print ("None: ")
print(soup.prettify(formatter=None))

Output


minimal: 
<p>
 Il a dit <<Sacré bleu!>>
</p>

html:
<p>
 Il a dit <<Sacré bleu!>>
</p>

None:
<p>
 Il a dit <<Sacré bleu!>>
</p>

In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method.

HTMLFormatter class − Used to customize the formatting rules for HTML documents.

XMLFormatter class − Used to customize the formatting rules for XML documents.

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *