beautiful Soup Archives - Page 8 of 11 - Donotsad where can learn any thing work project and make money

Aug 09

Beautiful Soup – insert_before Method

Beautiful Soup – insert_before() Method ”; Previous Next Method Description The insert_before() method in Beautiful soup inserts tags or strings immediately before something else in the parse tree. The inserted element becomes the immediate predecessor of this one. The inserted element can be a tag or a string. Syntax insert_before(*args) Parameters args − One or more elements, may be tag or a string. Return Value This insert_before() method doesn”t return any new object. Example 1 The following example inserts a text “Here is an” before “Excellent in the given HTML markup string. from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent</b> Python Tutorial <u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_before(“Here is an “) print (soup.prettify()) Output Here is an <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> Example 2 You can also insert a tag before another tag. Take a look at this example. from bs4 import BeautifulSoup, NavigableString markup = ”<P>Excellent <b>Tutorial</b> from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”b”) tag1.string = “Python “ tag.insert_before(tag1) print (soup.prettify()) Output <p> Excellent <b> Python </b> <b> Tutorial </b> from TutorialsPoint </p> Example 3 The following code passes more than one strings to be inserted before the <b> tag. from bs4 import BeautifulSoup markup = ”<p>There are <b>Tutorials</b> <u>from TutorialsPoint</u></p>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_before(“many “, ”excellent ”) print (soup.prettify()) Output <p> There are many excellent <b> Tutorials </b> <u> from TutorialsPoint </u> </p> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Discussion

Discuss Beautiful Soup ”; Previous Next In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for efficiently and clearly navigating, searching and parsing HTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4 in this tutorial. You can combine multiple functionalities introduced in this tutorial into one bigger program to capture multiple meaningful data from the website into some other sub-program as input. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – next_elements Property

Beautiful Soup – next_elements Property ”; Previous Next Method Description In Beautiful Soup library, the next_elements property returns a generator object containing the next strings or tags in the parse tree. Syntax Element.next_elements Return value The next_elements property returns a generator. Example 1 The next_elements property returns tags and NavibaleStrings appearing after the <b> tag in the document string below − html = ””” <p><b>Excellent</b><p>Python</p><p id=”id1”>Tutorial</p></p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) tag = soup.find(”b”) nexts = tag.next_elements print (“Next elements:”) for next in nexts: print (next) Output Next elements: Excellent Python Python <p id=”id1″>Tutorial</p> Tutorial Example 2 All the elements appearing after the <p> tag are listed below − from bs4 import BeautifulSoup html = ””” <p> <b>Excellent</b><i>Python</i> </p> <u>Tutorial</u> ””” soup = BeautifulSoup(html, ”html.parser”) tag1 = soup.find(”p”) print (“Next elements:”) print (list(tag1.next_elements)) Output Next elements: [”n”, <b>Excellent</b>, ”Excellent”, <i>Python</i>, ”Python”, ”n”, ”n”, <u>Tutorial</u>, ”Tutorial”, ”n”] Example 3 The elements next to the input tag present in the HTML form of index.html are listed below − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html5lib”) tag = soup.find(”input”) nexts = soup.previous_elements print (“Next elements:”) for next in nexts: print (next) Output Next elements: <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – decompose Method

Beautiful Soup – decompose() Method ”; Previous Next Method Description The decompose() method destroys current element along with its children, thus the element is removed from the tree, wiping it out and everything beneath it. You can check whether an element has been decomposed, by the `decomposed` property. It returns True if destroyed, false otherwise. Syntax decompose() Parameters No parameters are defined for this method. Return Type The method doesn”t return any object. Example 1 When we call descompose() method on the BeautifulSoup object itself, the entire content will be destroyed. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.decompose() print (“decomposed:”,soup.decomposed) print (soup) Output decomposed: True document: Traceback (most recent call last): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ TypeError: can only concatenate str (not “NoneType”) to str Since the soup object is decomposed, it returns True, however, you get TypeError as shown above. Example 2 The code below makes use of decompose() method to remove all the occurrences of <p> tags in the HTML string used. html = ””” <html> <body> <p>The quick, brown fox jumps over a lazy dog.</p> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) p_all = soup.find_all(”p”) [p.decompose() for p in p_all] print (“document:”,soup) Output Rest of the HTML document after removing all <p> tags will be printed. document: <html> <body> </body> </html> Example 3 Here, we find the <body> tag from the HTML document tree and decompose the previous element which happens to be the <title> tag. The resultant document tree omits the <title> tag. html = ””” <html> <head> <title>TutorialsPoint</title> </head> <body> Hello World </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) tag = soup.body tag.find_previous().decompose() print (“document:”,soup) Output document: <html> <head> </head> <body> Hello World </body> </html> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – insert_after Method

Beautiful Soup – insert_after() Method ”; Previous Next Method Description The insert_after() method in Beautiful soup inserts tags or strings immediately after something else in the parse tree. The inserted element becomes the immediate successor of this one. The inserted element can be a tag or a string. Syntax insert_after(*args) Parameters args − One or more elements, may be tag or a string. Return Value This insert_after() method doesn”t return any new object. Example 1 Following code inserts a string “Python” after the first <b> tag. from bs4 import BeautifulSoup markup = ”<p>An <b>Excellent</b> Tutorial <u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_after(“Python “) print (soup.prettify()) Output <p> An <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> </p> Example 2 You can also insert a tag before another tag. Take a look at this example. from bs4 import BeautifulSoup, NavigableString markup = ”<P>Excellent <b>Tutorial</b> from TutorialsPoint</p>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”b”) tag1.string = “on Python “ tag.insert_after(tag1) print (soup.prettify()) Output <p> Excellent <b> Tutorial </b> <b> on Python </b> from TutorialsPoint </p> Example 3 Multiple tags or strings can be inserted after a certain tags. from bs4 import BeautifulSoup, NavigableString markup = ”<P>Excellent <b>Tutorials</b> from TutorialsPoint</p>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p tag1 = soup.new_tag(”i”) tag1.string = ”and Java” tag.insert_after(“on Python”, tag1) print (soup.prettify()) Output <p> Excellent <b> Tutorials </b> from TutorialsPoint </p> on Python <i> and Java </i> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – find_parent Method

Beautiful Soup – find_parent() Method ”; Previous Next Method Description The find_parent() method in BeautifulSoup package finds the closest parent of this PageElement that matches the given criteria. Syntax find_parent( name, attrs, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. kwargs − A dictionary of filters on attribute values. Return Type The find_parent() method returns Tag object or a NavigableString object. Example 1 We shall use following HTML script in this example − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> In the following example, we find the name of the tag that is parent to the string ”HR”. from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) obj=soup.find(string=”HR”) print (obj.find_parent().name) Output li Example 2 The <body> tag is always enclosed within the top level <html> tag. In the following example, we confirm this fact with find_parent() method − from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) obj=soup.find(”body”) print (obj.find_parent().name) Output html Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – find_all_previous Method

Beautiful Soup – find_all_previous() Method ”; Previous Next Method Description The find_all_previous() method in Beautiful Soup look backwards in the document from this PageElement and finds all the PageElements that match the given criteria and appear before the current element. It returns a ResultsSet of PageElements that comes before the current tag in the document. Like all other find methods, this method has the following syntax − Syntax find_previous(name, attrs, string, limit, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. limit − Stop looking after finding this many results. kwargs − A dictionary of filters on attribute values. Return Value The find_all_previous() method returns a ResultSet of Tag or NavigableString objects. If the limit parameter is 1, the method is equivalent to find_previous() method. Example 1 In this example, name property of each object that appears before the first input tag is displayed. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”) for t in tag.find_all_previous(): print (t.name) Output form h1 body title head html Example 2 In the HTML document under consideration (index.html), there are three input elements. With the following code, we print the tag names of all preceding tags before thr <input> tag with nm attribute as marks. To differentiate between the two input tags before it, we also print the attrs property. Note that the other tags don”t have any attributes. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”marks”}) pretags = tag.find_all_previous() for pretag in pretags: print (pretag.name, pretag.attrs) Output input {”type”: ”text”, ”id”: ”age”, ”name”: ”age”} input {”type”: ”text”, ”id”: ”nm”, ”name”: ”name”} form {} h1 {} body {} title {} head {} html {} Example 3 The BeautifulSoup object stores the entire document”s tree. It doesn”t have any previous element, as the example below shows − from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all_previous() print (tags) Output [] Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – find_next Method

Beautiful Soup – find_next() Method ”; Previous Next Method Description The find_next() method in Beautiful soup finds the first PageElement that matches the given criteria and appears later in the document. returns the first tag or NavigableString that comes after the current tag in the document. Like all other find methods, this method has the following syntax − Syntax find_next(name, attrs, string, **kwargs) Parameters name − A filter on tag name. attrs − A dictionary of filters on attribute values. string − A filter for a NavigableString with specific text. kwargs − A dictionary of filters on attribute values. Return Value This find_next () method returns a Tag or a NavigableString Example 1 A web page index.html with following script has been used for this example <html> <head> <title>TutorialsPoint</title> </head> <body> <h1>TutorialsPoint</h1> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> We first locate the <form> tag and then the one next to it. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.h1 print (tag.find_next()) Output <form> <input id=”nm” name=”name” type=”text”/> <input id=”age” name=”age” type=”text”/> <input id=”marks” name=”marks” type=”text”/> </form> Example 2 In this example, we first locate the <input> tag with its name=”age” and obtain its next tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(”input”, {”name”:”age”}) print (tag.find_next()) Output <input id=”marks” name=”marks” type=”text”/> Example 3 The tag next to the <head> tag happens to be <title> tag. from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.head print (tag.find_next()) Output <title>TutorialsPoint</title> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – decode Method

Beautiful Soup – decode() Method ”; Previous Next Method Description The decode() method in Beautiful Soup returns a string or Unicode representation of the parse tree as an HTML or XML document. The method decodes the bytes using the codec registered for encoding. Its function is opposite to that of encode() method. You call encode() to get a bytestring, and decode() to get Unicode. Let us study decode() method with some examples. Syntax decode(pretty_print, encoding, formatter, errors) Parameters pretty_print − If this is True, indentation will be used to make the document more readable. encoding − The encoding of the final document. If this is None, the document will be a Unicode string. formatter − A Formatter object, or a string naming one of the standard formatters. errors − The error handling scheme to use for the handling of decoding errors. Values are ”strict”, ”ignore” and ”replace”. Return Value The decode() method returns a Unicode String. Example from bs4 import BeautifulSoup soup = BeautifulSoup(“Hello “World!””, ”html.parser”) enc = soup.encode(”utf-8”) print (enc) dec = enc.decode() print (dec) Output b”Hello xe2x80x9cWorld!xe2x80x9d” Hello “World!” Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – contents Property

Beautiful Soup – contents Property ”; Previous Next Method Description The contents property is available with the Soup object as well as Tag object. It returns a list everything that is contained inside the object, all the immediate child elements and text nodes (i.e. Navigable String). Syntax Tag.contents Return value The contents property returns a list of child elements and strings in the Tag/Soup object,. Example 1 Contents of a tag object − from bs4 import BeautifulSoup markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div print (tag.contents) Output [”n”, <p>Java</p>, ”n”, <p>Python</p>, ”n”, <p>C++</p>, ”n”] Example 2 Contents of the entire document − from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) print (soup.contents) Output [”n”, <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>, ”n”] Example 3 Note that a NavigableString object doesn”t have contents property. It throws AttributeError if we try to access the same. from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.p s=tag.contents[0] print (s.contents) Output Traceback (most recent call last): File “C:UsersuserBeautifulSoup2.py”, line 11, in <module> print (s.contents) ^^^^^^^^^^ File “C:UsersuserBeautifulSoupLibsite-packagesbs4element.py”, line 984, in __getattr__ raise AttributeError( AttributeError: ”NavigableString” object has no attribute ”contents” Print Page Previous Next Advertisements ”;