Beautiful Soup – previous_elements Property ”; Previous Next Method Description In Beautiful Soup library, the previous_elements property returns a generator object containing the previous strings or tags in the parse tree. Syntax Element.previous_elements Return value The previous_elements property returns a generator. Example 1 The previous_elements property returns tags and NavibaleStrings appearing before the <p> tag in the document string below − html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, id=”id1”) pres = tag.previous_elements print (“Previous elements:”) for pre in pres: print (pre) Output Previous elements: Python <p>Python</p> Excellent <b>Excellent</b> <p><b>Excellent</b><p>Python</p><p id=”id1″>Tutorial</p></p> Example 2 All the elements appearing before the <u> tag are listed below − from bs4 import BeautifulSoup html = ””” <p> <b>Excellent</b><i>Python</i> </p> <u>Tutorial</u> ””” soup = BeautifulSoup(html, ”html.parser”) tag1 = soup.find(”u”) print (“previous elements:”) print (list(tag1.previous_elements)) Output previous elements: [”n”, ”n”, ”Python”, <i>Python</i>, ”Excellent”, <b>Excellent</b>, ”n”, <p> <b>Excellent</b><i>Python</i> </p>, ”n”] Example 3 The BeautifulSoup object itself doesn”t have any previous elements − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html5lib”) tag = soup.find(”input”, id=”marks”) pres = soup.previous_elements print (“Previous elements:”) for pre in pres: print (pre.name) Output Previous elements: Print Page Previous Next Advertisements ”;
Author: user
Beautiful Soup – find Method
Beautiful Soup – find() Method ”; Previous Next Method Description The find() method in Beautiful Soup looks for the first Element that matches the given criteria in the children of this PageElement and returns it. Syntax Soup.find(name, attrs, recursive, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. recursive − If this is True, find() a recursive search will be performed. Otherwise, only the direct children will be considered. limit − Stop looking after specified number of occurrences have been found. kwargs − A dictionary of filters on attribute values. Return value The find() method returns Tag object or a NavigableString object Example 1 Let us use the following HTML script (as index.html) for the purpose <html> <head> <title>TutorialsPoint</title> </head> <body> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> The following Python code finds the element with its id as nm from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find(id = ”nm”) print (obj) Output <input id=”nm” name=”name” type=”text”/> Example 2 The find() method returns the first tag in the parsed document that has the given attributes. obj = soup.find(attrs={“name”:”marks”}) Output <input id=”marks” name=”marks” type=”text”/> Example 3 If find() can”t find anything, it returns None obj = soup.find(”dummy”) print (obj) Output None Print Page Previous Next Advertisements ”;
Beautiful Soup – previous_sibling Property ”; Previous Next Method Description The HTML tags appearing at the same indentation level are called siblings. The previous_sibling property of the PageElement returns a previous tag (a tag appearing before the current tag) at the same level, or under the same parent. This property encapsulates the find_previous_sibling() method. Syntax element.previous_sibling Return type The previous_sibling property returns a PageElement, a Tag or a NavigableString object. Example 1 In the following code, the HTML string consists of two adjacent tags inside a <p> tag. It shows the sibling tag for <b> tag appearing before it. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(“<p><b>Hello</b><i>Python</i></p>”, ”html.parser”) tag = soup.i sibling = tag.previous_sibling print (sibling) Output <b>Hello</b> Example 2 We are using the index.html file for parsing. The page contains a HTML form with three input elements. Which element is a previous sibling of input element with its id attribute as age? The following code shows it − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”age”}) sib = tag.previous_sibling.previous_sibling print (sib) Output <input id=”nm” name=”name” type=”text”/> Example 3 First we find the <p> tag containing the string ”Tutorial” and then fins a tag previous to it. html = ””” <p>Excellent</p><p>Python</p><p>Tutorial</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”p”, string=”Tutorial”) print (tag.previous_sibling) Output <p>Python</p> Print Page Previous Next Advertisements ”;
Beautiful Soup – next_siblings Property ”; Previous Next Method Description The HTML tags appearing at the same indentation level are called siblings. The next_siblings property in Beautiful Soup returns returns a generator object used to iterate over all the subsequent tags and strings under the same parent. Syntax element.next_siblings Return type The next_siblings property returns a generator of sibling PageElements. Example 1 In HTML form code in index.html contains three input elements. Following script uses next_siblings property to collect next siblings of an input element wit id attribute as nm from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”id”:”nm”}) siblings = tag.next_siblings print (list(siblings)) Output [”n”, <input id=”age” name=”age” type=”text”/>, ”n”, <input id=”marks” name=”marks” type=”text”/>, ”n”] Example 2 Let us use the following HTML snippet for this purpose − Use the following code to traverse next siblings tags. from bs4 import BeautifulSoup soup = BeautifulSoup(“<p><b>Excellent</b><i>Python</i><u>Tutorial</u></p>”, ”html.parser”) tag1 = soup.b print (“next siblings:”) for tag in tag1.next_siblings: print (tag) Output next siblings: <i>Python</i> <u>Tutorial</u> Example 3 Next example shows that the <head> tag has only one next sibling in the form of body tag. html = ””” <html> <head> <title>Hello</title> </head> <body> <p>Excellent</p><p>Python</p><p>Tutorial</p> </body> </head> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tags = soup.head.next_siblings print (“next siblings:”) for tag in tags: print (tag) Output next siblings: <body> <p>Excellent</p><p>Python</p><p>Tutorial</p> </body> The additional lines are because of the linebreaks in the generator. Print Page Previous Next Advertisements ”;
Beautiful Soup – descendants Property ”; Previous Next Method Description With the descendants property of a PageElement object in Beautiful Soup API you can traverse the list of all children under it. This property returns a generator object, with which the children elements can be retrieved in a breadth-first sequence. While searching a tree structure, the Breadth-first traversal starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Syntax tag.descendants Return value The descendants property returns a generator object. Example 1 In the code below, we have a HTML document with nested unordered list tags. We scrape through the children elements parsed in breadth-first manner. html = ””” <ul id=”outer”> <li class=”mainmenu”>Accounts</li> <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”mainmenu”>HR</li> <ul> <li class=”submenu”>Anil</li> <li class=”submenu”>Milind</li> </ul> </ul> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”ul”, {”id”: ”outer”}) tags = soup.descendants for desc in tags: print (desc) Output <ul id=”outer”> <li class=”mainmenu”>Accounts</li> <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”mainmenu”>HR</li> <ul> <li class=”submenu”>Anil</li> <li class=”submenu”>Milind</li> </ul> </ul> <li class=”mainmenu”>Accounts</li> Accounts <ul> <li class=”submenu”>Anand</li> <li class=”submenu”>Mahesh</li> </ul> <li class=”submenu”>Anand</li> Anand <li class=”submenu”>Mahesh</li> Mahesh <li class=”mainmenu”>HR</li> HR <ul> <li class=”submenu”>Anil</li> <li class=”submenu”>Milind</li> </ul> <li class=”submenu”>Anil</li> Anil <li class=”submenu”>Milind</li> Milind Example 2 In the following example, we list out the descendants of <head> tag html = “”” <html><head><title>TutorialsPoint</title></head> <body> <p>Hello World</p> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.head for element in tag.descendants: print (element) Output <title>TutorialsPoint</title> TutorialsPoint Print Page Previous Next Advertisements ”;
Beautiful Soup – strings Property ”; Previous Next Method Description For any PageElement having more than one children, the inner text of each can be fetched by the strings property. Unlike the string property, strings handles the case when the element contains multiple children. The strings property returns a generator object. It yields a sequence of NavigableStrings corresponding to each of the child elements. Syntax Tag.strings Example 1 You can retrieve the value od strings property for soup as well as a tag object. In the following example, the soup object”s stings property is checked. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print ([string for string in soup.strings]) Output [”n”, ”n”, ”Java”, ” ”, ”Python”, ” ”, ”C++”, ”n”, ”n”] Note the line breaks and white spaces in the list.We can remove them with stripped_strings property. Example 2 We now obtain a generator object returned by the strings property of <div> tag. With a loop, we print the strings. tag = soup.div navstrs = tag.strings for navstr in navstrs: print (navstr) Output Java Python C++ Note that the line breaks and whiteapces have appeared in the output, which can be removed with stripped_strings property. Print Page Previous Next Advertisements ”;
Beautiful Soup – stripped_strings Property ”; Previous Next Method Description The stripped_strings property of a Tag/Soup object gives the return similar to strings property, except for the fact that the extra line breaks and whitespaces are stripped off. Hence, it can be said that the stripped_strings property results in a generator of NavigableString objects of the inner elements belonging to the object in use. Syntax Tag.stripped_strings Example 1 In the example below, the strings of all the elements in the document tree parsed in a BeautifulSoup object are displayed after applying the stripping. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print ([string for string in soup.stripped_strings]) Output [”Java”, ”Python”, ”C++”] Compared to the output of strings property, you can see that the line breaks and whitespaces are removed. Example 2 Here we extract the NavigableStrings of each of the child elements under the <div> tag. tag = soup.div navstrs = tag.stripped_strings for navstr in navstrs: print (navstr) Output Java Python C++ Print Page Previous Next Advertisements ”;
Beautiful Soup – Parsing XML
Beautiful Soup – Parsing XML ”; Previous Next BeautifulSoup can also parse a XML document. You need to pass fatures=”xml” argument to Beautiful() constructor. Assuming that we have the following books.xml in the current working directory − Example <?xml version=”1.0″ ?> <books> <book> <title>Python</title> <author>TutorialsPoint</author> <price>400</price> </book> </books> The following code parses the given XML file − from bs4 import BeautifulSoup fp = open(“books.xml”) soup = BeautifulSoup(fp, features=”xml”) print (soup) print (”type:”, type(soup)) When the above code is executed, you should get the following result − <?xml version=”1.0″ encoding=”utf-8″?> <books> <book> <title>Python</title> <author>TutorialsPoint</author> <price>400</price> </book> </books> type: <class ”bs4.BeautifulSoup”> XML parser Error By default, BeautifulSoup package parses the documents as HTML, however, it is very easy-to-use and handle ill-formed XML in a very elegant manner using beautifulsoup4. To parse the document as XML, you need to have lxml parser and you just need to pass the “xml” as the second argument to the Beautifulsoup constructor − soup = BeautifulSoup(markup, “lxml-xml”) or soup = BeautifulSoup(markup, “xml”) One common XML parsing error is − AttributeError: ”NoneType” object has no attribute ”attrib” This might happen in case, some element is missing or not defined while using find() or findall() function. Print Page Previous Next Advertisements ”;
Beautiful Soup – Convert Object to String ”; Previous Next The Beautiful Soup API has three main types of objects. The soup object, the Tag object, and the NavigableString object. Let us find out how we can convert each of these object to string. In Python, string is a str object. Assuming that we have a following HTML document html = ””” <p>Hello <b>World</b></p> ””” Let us put this string as argument for BeautifulSoup constructor. The soup object is then typecast to string object with Python”s builtin str() function. The parsed tree of this HTML string will be constructed dpending upon which parser you use. The built-in html parser doesn”t add the <html> and <body> tags. Example from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) print (str(soup)) Output <p>Hello <b>World</b></p> On the other hand, the html5lib parser constructs the tree after inserting the formal tags such as <html> and <body> from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html5lib”) print (str(soup)) Output <html><head></head><body><p>Hello <b>World</b></p> </body></html> The Tag object has a string property that returns a NavigableString object. tag = soup.find(”b”) obj = (tag.string) print (type(obj),obj) Output string <class ”bs4.element.NavigableString”> World There is also a Text property defined for Tag object. It returns the text contained in the tag, stripping off all the inner tags and attributes. If the HTML string is − html = ””” <p>Hello <div id=”id”>World</div></p> ””” We try to obtain the text property of <p> tag tag = soup.find(”p”) obj = (tag.text) print ( type(obj), obj) Output <class ”str”> Hello World You can also use the get_text() method which returns a string representing the text inside the tag. The function is actually a wrapper arounf the text property as it also gets rid of inner tags and attributes, and returns a string obj = tag.get_text() print (type(obj),obj) Output <class ”str”> Hello World Print Page Previous Next Advertisements ”;
Beautiful Soup – Comparing Objects ”; Previous Next As per the beautiful soup, two navigable string or tag objects are equal if they represent the same HTML/XML markup. Now let us see the below example, where the two <b> tags are treated as equal, even though they live in different parts of the object tree, because they both look like “<b>Java</b>”. Example from bs4 import BeautifulSoup markup = “<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>” soup = BeautifulSoup(markup, “html.parser”) b1 = soup.find(”b”) b2 = b1.find_next(”b”) print(b1== b2) print(b1 is b2) Output True False In the following examples, tow NavigableString objects are compared. Example from bs4 import BeautifulSoup markup = “<p>Learn <i>Python</i>, <b>Java</b>, advanced <i>Python</i> and advanced <b>Java</b>! from Tutorialspoint</p>” soup = BeautifulSoup(markup, “html.parser”) i1 = soup.find(”i”) i2 = i1.find_next(”i”) print(i1.string== i2.string) print(i1.string is i2.string) Output True False Print Page Previous Next Advertisements ”;