Beautiful Soup – stripped_strings Property ”; Previous Next Method Description The stripped_strings property of a Tag/Soup object gives the return similar to strings property, except for the fact that the extra line breaks and whitespaces are stripped off. Hence, it can be said that the stripped_strings property results in a generator of NavigableString objects of the inner elements belonging to the object in use. Syntax Tag.stripped_strings Example 1 In the example below, the strings of all the elements in the document tree parsed in a BeautifulSoup object are displayed after applying the stripping. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print ([string for string in soup.stripped_strings]) Output [”Java”, ”Python”, ”C++”] Compared to the output of strings property, you can see that the line breaks and whitespaces are removed. Example 2 Here we extract the NavigableStrings of each of the child elements under the <div> tag. tag = soup.div navstrs = tag.stripped_strings for navstr in navstrs: print (navstr) Output Java Python C++ Print Page Previous Next Advertisements ”;
Category: beautiful Soup
Beautiful Soup – Error Handling ”; Previous Next While trying to parse HTML/XML document with Beautiful Soup, you may encounter errors, not from your script but from the structure of the snippet because the BeautifulSoup API throws an error. By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4. To parse the document as XML, you need to have lxml parser and you just need to pass the “xml” as the second argument to the Beautifulsoup constructor − soup = BeautifulSoup(markup, “lxml-xml”) or soup = BeautifulSoup(markup, “xml”) One common XML parsing error is − AttributeError: ”NoneType” object has no attribute ”attrib” This might happen in case, some element is missing or not defined while using find() or findall() function. Apart from the above mentioned parsing errors, you may encounter other parsing issues such as environmental issues where your script might work in one operating system but not in another operating system or may work in one virtual environment but not in another virtual environment or may not work outside the virtual environment. All these issues may be because the two environments have different parser libraries available. It is recommended to know or check your default parser in your current working environment. You can check the current default parser available for the current working environment or else pass explicitly the required parser library as second arguments to the BeautifulSoup constructor. As the HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. However, if you want to preserve mixed-case or uppercase tags and attributes, then it is better to parse the document as XML. UnicodeEncodeError Let us look into below code segment − Example soup = BeautifulSoup(response, “html.parser”) print (soup) Output UnicodeEncodeError: ”charmap” codec can”t encode character ”u011f” Above problem may be because of two main situations. You might be trying to print out a unicode character that your console doesn”t know how to display. Second, you are trying to write to a file and you pass in a Unicode character that”s not supported by your default encoding. One way to resolve above problem is to encode the response text/character before making the soup to get the desired result, as follows − responseTxt = response.text.encode(”UTF-8”) KeyError: [attr] It is caused by accessing tag[”attr”] when the tag in question doesn”t define the attr attribute. Most common errors are: “KeyError: ”href”” and “KeyError: ”class””. Use tag.get(”attr”) if you are not sure attr is defined. for item in soup.fetch(”a”): try: if (item[”href”].startswith(”/”) or “tutorialspoint” in item[”href”]): (…) except KeyError: pass # or some other fallback action AttributeError You may encounter AttributeError as follows − AttributeError: ”list” object has no attribute ”find_all” The above error mainly occurs because you expected find_all() return a single tag or string. However, soup.find_all returns a python list of elements. All you need to do is to iterate through the list and catch data from those elements. To avoid the above errors when parsing a result, that result will be bypassed to make sure that a malformed snippet isn”t inserted into the databases − except(AttributeError, KeyError) as er: pass Print Page Previous Next Advertisements ”;
Beautiful Soup – Trouble Shooting ”; Previous Next If you run into problems while trying to parse a HTML/XML document, it is more likely because how the parser in use is interpreting the document. To help you locate and correct the problem, Beautiful Soup API provides a dignose() utility. The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you”re facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you”re missing a parser. The diagnose() method is defined in bs4.diagnose module. Its output starts with a message as follows − Example diagnose(markup) Output Diagnostic running on Beautiful Soup 4.12.2 Python version 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] Found lxml version 4.9.2.0 Found html5lib version 1.1 Trying to parse your markup with html.parser Here”s what html.parser did with the markup: If it doesn”t find any of these parsers, a corresponding message also appears. I noticed that html5lib is not installed. Installing it may help. If the HTML document fed to diagnose() method is perfectly formed, the parsed tree by any of the parsers will be identical. However if it is not properly formed, then different parser interprets differently. If you don”t get the tree as you anticipate, changing the parser might help. Sometimes, you may have chosen HTML parser for a XML document. The HTML parsers add all the HTML tags while parsing the document incorrectly. Looking at the output, you will realize the error and can help in correcting. If Beautiful Soup raises HTMLParser.HTMLParseError, try and change the parser. parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag are both generated by Python”s built-in HTML parser library, and the solution is to install lxml or html5lib. If you encounter SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = ”[document]”), it is caused by running an old Python 2 version of Beautiful Soup under Python 3, without converting the code. The ImportError with message No module named HTMLParser is because of an old Python 2 version of Beautiful Soup under Python 3. While, ImportError: No module named html.parser – is caused by running the Python 3 version of Beautiful Soup under Python 2. If you get ImportError: No module named BeautifulSoup – more often than not, it is because of running Beautiful Soup 3 code on a system that doesn”t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4. Finally, ImportError: No module named bs4 – is due to the fact that you are trying a Beautiful Soup 4 code on a system that doesn”t have BS4 installed. Print Page Previous Next Advertisements ”;
Beautiful Soup – Parsing XML
Beautiful Soup – Parsing XML ”; Previous Next BeautifulSoup can also parse a XML document. You need to pass fatures=”xml” argument to Beautiful() constructor. Assuming that we have the following books.xml in the current working directory − Example <?xml version=”1.0″ ?> <books> <book> <title>Python</title> <author>TutorialsPoint</author> <price>400</price> </book> </books> The following code parses the given XML file − from bs4 import BeautifulSoup fp = open(“books.xml”) soup = BeautifulSoup(fp, features=”xml”) print (soup) print (”type:”, type(soup)) When the above code is executed, you should get the following result − <?xml version=”1.0″ encoding=”utf-8″?> <books> <book> <title>Python</title> <author>TutorialsPoint</author> <price>400</price> </book> </books> type: <class ”bs4.BeautifulSoup”> XML parser Error By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4. To parse the document as XML, you need to have lxml parser and you just need to pass the “xml” as the second argument to the Beautifulsoup constructor − soup = BeautifulSoup(markup, “lxml-xml”) or soup = BeautifulSoup(markup, “xml”) One common XML parsing error is − AttributeError: ”NoneType” object has no attribute ”attrib” This might happen in case, some element is missing or not defined while using find() or findall() function. Print Page Previous Next Advertisements ”;
Beautiful Soup – Convert Object to String ”; Previous Next The Beautiful Soup API has three main types of objects. The soup object, the Tag object, and the NavigableString object. Let us find out how we can convert each of these object to string. In Python, string is a str object. Assuming that we have a following HTML document html = ””” <p>Hello <b>World</b></p> ””” Let us put this string as argument for BeautifulSoup constructor. The soup object is then typecast to string object with Python”s builtin str() function. The parsed tree of this HTML string will be constructed dpending upon which parser you use. The built-in html parser doesn”t add the <html> and <body> tags. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (str(soup)) Output <p>Hello <b>World</b></p> On the other hand, the html5lib parser constructs the tree after inserting the formal tags such as <html> and <body> from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html5lib”) print (str(soup)) Output <html><head></head><body><p>Hello <b>World</b></p> </body></html> The Tag object has a string property that returns a NavigableString object. tag = soup.find(”b”) obj = (tag.string) print (type(obj),obj) Output string <class ”bs4.element.NavigableString”> World There is also a Text property defined for Tag object. It returns the text contained in the tag, stripping off all the inner tags and attributes. If the HTML string is − html = ””” <p>Hello <div id=”id”>World</div></p> ””” We try to obtain the text property of <p> tag tag = soup.find(”p”) obj = (tag.text) print ( type(obj), obj) Output <class ”str”> Hello World You can also use the get_text() method which returns a string representing the text inside the tag. The function is actually a wrapper arounf the text property as it also gets rid of inner tags and attributes, and returns a string obj = tag.get_text() print (type(obj),obj) Output <class ”str”> Hello World Print Page Previous Next Advertisements ”;
Beautiful Soup – Comparing Objects ”; Previous Next As per the beautiful soup, two navigable string or tag objects are equal if they represent the same HTML/XML markup. Now let us see the below example, where the two <b> tags are treated as equal, even though they live in different parts of the object tree, because they both look like “<b>Java</b>”. Example from bs4 import BeautifulSoup markup = “<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>” soup = BeautifulSoup(markup, “html.parser”) b1 = soup.find(”b”) b2 = b1.find_next(”b”) print(b1== b2) print(b1 is b2) Output True False In the following examples, tow NavigableString objects are compared. Example from bs4 import BeautifulSoup markup = “<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>” soup = BeautifulSoup(markup, “html.parser”) i1 = soup.find(”i”) i2 = i1.find_next(”i”) print(i1.string== i2.string) print(i1.string is i2.string) Output True False Print Page Previous Next Advertisements ”;
Beautiful Soup – Output Formatting ”; Previous Next If the HTML string given to BeautifulSoup constructor contains any of the HTML entities, they will be converted to Unicode characters. An HTML entity is a string that begins with an ampersand ( & ) and ends with a semicolon ( ; ). They are used to display reserved characters (which would otherwise be interpreted as HTML code). Some of the examples of HTML entities are − < less than < < > greater than > > & ampersand & & “ double quote " " ” single quote ' ' “ Left Double quote “ “ “ Right double quote ” ” £ Pound £ £ ¥ yen ¥ ¥ € euro € € © copyright © © By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>” For others, they”ll be converted to Unicode characters. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) print (str(soup)) Output Hello “World!” If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won”t get the HTML entities back − Example from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) print (soup.encode()) Output b”Hello xe2x80x9cWorld!xe2x80x9d” To change this behavior provide a value for the formatter argument to prettify() method. There are following possible values for the formatter. formatter=”minimal” − This is the default. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML formatter=”html” − Beautiful Soup will convert Unicode characters to HTML entities whenever possible. formatter=”html5″ − it”s similar to formatter=”html”, but Beautiful Soup will omit the closing slash in HTML void tags like “br” formatter=None − Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML Example from bs4 import BeautifulSoup french = “<p>Il a dit <<Sacré bleu!>></p>” soup = BeautifulSoup(french, ”html.parser”) print (“minimal: “) print(soup.prettify(formatter=”minimal”)) print (“html: “) print(soup.prettify(formatter=”html”)) print (“None: “) print(soup.prettify(formatter=None)) Output minimal: <p> Il a dit <<Sacré bleu!>> </p> html: <p> Il a dit <<Sacré bleu!>> </p> None: <p> Il a dit <<Sacré bleu!>> </p> In addition, Beautiful Soup library provides formatter classes. You can pass an object of any of these classes as argument to prettify() method. HTMLFormatter class − Used to customize the formatting rules for HTML documents. XMLFormatter class − Used to customize the formatting rules for XML documents. Print Page Previous Next Advertisements ”;
Beautiful Soup – string Property ”; Previous Next Method Description In Beautiful Soup, the soup and Tag object has a convenience property – string property. It returns a single string within a PageElement, Soup or Tag. If this element has a single string child, then a NavigableString corresponding to it is returned. If this element has one child tag, return value is the ”string” attribute of the child tag, and if element itself is a string, (with no children), then the string property returns None. Syntax Tag.string Example 1 The following code has the HTML string with a <div> tag that encloses three <p> elements. We find the string property of first <p> tag. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p navstr = tag.string print (navstr, type(navstr)) nav_str = str(navstr) print (nav_str, type(nav_str)) Output Java <class ”bs4.element.NavigableString”> Java <class ”str”> The string property returns a NavigableString. It can be cast to a regular Python string with str() function Example 2 The string property of an element with children elements inside, returns None. Check with the <div> tag. tag = soup.div navstr = tag.string print (navstr) Output None Print Page Previous Next Advertisements ”;
Beautiful Soup – NavigableString Class ”; Previous Next One of the main objects prevalent in Beautiful Soup API is the object of NavigableString class. It represents the string or text between the opening and closing counterparts of most of the HTML tags. For example, if <b>Hello</b> is the markup to be parsed, Hello is the NavigableString. NavigableString class is subclassed from the PageElement class in bs4 package, as well as Python”s built-in str class. Hence, it inherits the PageElement methods such as find_*(), insert, append, wrap,unwrap methods as well as methods from str class such as upper, lower, find, isalpha etc. The constructor of this class takes a single argument, a str object. Example from bs4 import NavigableString new_str = NavigableString(”world”) You can now use this NavigableString object to perform all kinds of operations on the parsed tree, such as append, insert, find etc. In the following example, we append the newly created NavigableString object to an existing Tab object. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b new_str = NavigableString(”world”) tag.append(new_str) print (soup) Output <b>Helloworld</b> Note that the NavigableString is a PageElement, hence it can be appended to the Soup object also. Check the difference if we do so. Example new_str = NavigableString(”world”) soup.append(new_str) print (soup) Output <b>Hello</b>world As we can see, the string appears after the <b> tag. Beautiful Soup offers a new_string() method. Create a new NavigableString associated with this BeautifulSoup object. Let us new_string() method to create a NavigableString object, and add it to the PageElements. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b ns=soup.new_string(” World”) tag.append(ns) print (tag) soup.append(ns) print (soup) Output <b>Hello World</b> <b>Hello</b> World We find an interesting behaviour here. The NavigableString object is added to a tag inside the tree, as well as to the soup object itself. While the tag shows the appended string, but in the soup object, the text World is appended, but it doesn”t show in the tag. This is because the new_string() method creates a NavigableString associated with the Soup object. Print Page Previous Next Advertisements ”;
Beautiful Soup – find vs find_all ”; Previous Next Beautiful Soup library includes find() as well as find_all() methods. Both methods are one of the most frequently used methods while parsing HTML or XML documents. From a particular document tree You often need to locate a PageElement of a certain tag type, or having certain attributes, or having a certain CSS style etc. These criteria are given as argument to both find() and find_all() methods. The main point of difference between the two is that while find() locates the very first child element that satisfies the criteria, find_all() method searches for all the children elements of the criteria. The find() method is defined with following syntax − Syntax find(name, attrs, recursive, string, **kwargs) The name argument specifies a filter on tag name. With attrs, a filter on tag attribute values can be set up. The recursive argument forces a recursive search if it is True. You can pass variable kwargs as dictionary of filters on attribute values. soup.find(id = ”nm”) soup.find(attrs={“name”:”marks”}) The find_all() method takes all the arguments as for the find() method, in addition there is a limit argument. It is an integer, restricting the search the specified number of occurrences of the given filter criteria. If not set, find_all() searches for the criteria among all the children under the said PageElement. soup.find_all(”input”) lst=soup.find_all(”li”, limit =2) If the limit argument for find_all() method is set to 1, it virtually acts as find() method. The return type of both the methods differs. The find() method returns either a Tag object or a NavigableString object first found. The find_all() method returns a ResultSet consisting of all the PageElements satisfying the filter criteria. Here is an example that demonstrates the difference between find and find_all methods. Example from bs4 import BeautifulSoup markup =open(“index.html”) soup = BeautifulSoup(markup, ”html.parser”) ret1 = soup.find(”input”) ret2 = soup.find_all (”input”) print (ret1, ”Return type of find:”, type(ret1)) print (ret2) print (”Return tyoe find_all:”, type(ret2)) #set limit =1 ret3 = soup.find_all (”input”, limit=1) print (”find:”, ret1) print (”find_all:”, ret3) Output <input id=”nm” name=”name” type=”text”/> Return type of find: <class ”bs4.element.Tag”> [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Return tyoe find_all: <class ”bs4.element.ResultSet”> find: <input id=”nm” name=”name” type=”text”/> find_all: [<input id=”nm” name=”name” type=”text”/>] Print Page Previous Next Advertisements ”;