user, Author at Donotsad where can learn any thing work project and make money

Aug 09

Beautiful Soup – Parsing a Section of a Document

Beautiful Soup – Parsing a Section of a Document ”; Previous Next Let”s say you want to use Beautiful Soup look at a document”s <a> tags only. Normally you would parse the tree and use find_all() method with the required tag as the argument. soup = BeautifulSoup(fp, “html.parser”) tags = soup.find_all(”a”) But that would be time consuming as well as it will take up more memory unnecessarily. Instead, you can create an object of SoupStrainer class and use it as value of parse_only argument to BeautifulSoup constructor. A SoupStrainer tells BeautifulSoup what parts extract, and the parse tree consists of only these elements. If you narrow down your required information to a specific portion of the HTML, this will speed up your search result. product = SoupStrainer(”div”,{”id”: ”products_list”}) soup = BeautifulSoup(html,parse_only=product) Above lines of code will parse only the titles from a product site, which might be inside a tag field. Similarly, like above we can use other soupStrainer objects, to parse specific information from an HTML tag. Below are some of the examples − Example from bs4 import BeautifulSoup, SoupStrainer #Only “a” tags only_a_tags = SoupStrainer(“a”) #Will parse only the below mentioned “ids”. parse_only = SoupStrainer(id=[“first”, “third”, “my_unique_id”]) soup = BeautifulSoup(my_document, “html.parser”, parse_only=parse_only) #parse only where string length is less than 10 def is_short_string(string): return len(string) < 10 only_short_strings = SoupStrainer(string=is_short_string) The SoupStrainer class takes the same arguments as a typical method from Searching the tree: name, attrs, text, and **kwargs. Note that this feature won”t work if you”re using the html5lib parser, because the whole document will be parsed in that case, no matter what. Hence, you should use either the inbuilt html.parser or lxml parser. You can also pass a SoupStrainer into any of the methods covered in Searching the tree. from bs4 import SoupStrainer a_tags = SoupStrainer(“a”) soup = BeautifulSoup(html_doc, ”html.parser”) soup.find_all(a_tags) Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Get Text Inside Tag

Beautiful Soup – Get Text Inside Tag ”; Previous Next There are two types of tags in HTML. Many of the tags are in pairs of opening and closing counterparts. The top level <html> tag having a corresponding closing </html> tag is the main example. Others are <body> and </body>, <p> and </p>, <h1> and </h1> and many more. Other tags are self-closing tags – such as <img> and<a>. The self-closing tags don”t have a text as most of the tags with opening and closing symbols (such as <b>Hello</b>). In this chapter, we shall have a look at how can we get the text part inside such tags, with the help of Beautiful Soup library. There are more than one methods/properties available in Beautiful Soup, with which we can fetch the text associated with a tag object. Sr.No Methods & Description 1 text property Get all child strings of a PageElement, concatenated using a separator if specified. 2 string property Convenience property to string from a child element. 3 strings property yields string parts from all the child objects under the current PageElement. 4 stripped_strings property Same as strings property, with the linebreaks and whitespaces removed. 5 get_text() method returns all child strings of this PageElement, concatenated using a separator if specified. Consider the following HTML document − <div id=”outer”> <div id=”inner”> <p>Hello<b>World</b></p> <img src=”logo.jpg”> </div> </div> If we retrieve the stripped_string property of each tag in the parsed document tree, we will find that the two div tags and the p tag have two NavigableString objects, Hello and World. The <b> tag embeds world string, while <img> doesn”t have a text part. The following example fetches the text from each of the tags in the given HTML document − Example html = “”” <div id=”outer”> <div id=”inner”> <p>Hello<b>World</b></p> <img src=”logo.jpg”> </div> </div> “”” from bs4 import BeautifulSoup soup = BeautifulSoup(html, ”html.parser”) for tag in soup.find_all(): print (“Tag: {} attributes: {} “.format(tag.name, tag.attrs)) for txt in tag.stripped_strings: print (txt) print() Output Tag: div attributes: {”id”: ”outer”} Hello World Tag: div attributes: {”id”: ”inner”} Hello World Tag: p attributes: {} Hello World Tag: b attributes: {} World Tag: img attributes: {”src”: ”logo.jpg”} Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Extract Title Tag

Beautiful Soup – Extract Title Tag ”; Previous Next The <title> tag is used to provide a text caption to the page that appears in the browser”s title bar. It is not a part of the main content of the web page. The title tag is always present inside the <head> tag. We can extract the contents of title tag by Beautiful Soup. We parse the HTML tree and obtain the title tag object. Example html = ””” <html> <head> <Title>Python Libraries</title> </head> <body> <p Hello World</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html5lib”) title = soup.title print (title) Output <title>Python Libraries</title> In HTML, we can use title attribute with all tags. The title attribute gives additional information about an element. The information is works as a tooltip text when the mouse hovers over the element. We can extract the text of title attribute of each tag with following code snippet − Example html = ””” <html> <body> <p title=”parsing HTML and XML”>Beautiful Soup</p> <p title=”HTTP library”>requests</p> <p title=”URL handling”>urllib</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html5lib”) tags = soup.find_all() for tag in tags: if tag.has_attr(”title”): print (tag.attrs[”title”]) Output parsing HTML and XML HTTP library URL handling Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – diagnose Method

Beautiful Soup – diagnose() Method ”; Previous Next Method Description The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you”re facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you”re missing a parser. Syntax diagnose(data) Parameters data − the document string. Return Value The diagnose() method prints the result of parsing the given document according all the available parsers. Example Let us take this simple document for our exercise − <h1>Hello World <b>Welcome</b> <P><b>Beautiful Soup</a> <i>Tutorial</i><p> The following code runs the diagnostics on the above HTML script − markup = ””” <h1>Hello World <b>Welcome</b> <P><b>Beautiful Soup</a> <i>Tutorial</i><p> ””” from bs4.diagnose import diagnose diagnose(markup) The diagonose() output starts with a message showing what all parsers are available − Diagnostic running on Beautiful Soup 4.12.2 Python version 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] Found lxml version 4.9.2.0 Found html5lib version 1.1 If the document to be diagnosed is a perfect HTML document, the result for all parsers is just about similar. However, in our example, there are many errors. To begin the built-in html.parser is take up. The report will be as follows − Trying to parse your markup with html.parser Here”s what html.parser did with the markup: <h1> Hello World <b> Welcome </b> <p> <b> Beautiful Soup <i> Tutorial </i> <p> </p> </b> </p> </h1> You can see that Python”s built-in parser doesn”t insert the <html> and <body> tags. The unclosed <h1> tag is provided with matching <h1> at the end. Both the html5lib and lxml parsers complete the document by wrapping it in <html>, <head> and <body> tags. Trying to parse your markup with html5lib Here”s what html5lib did with the markup: <html> <head> </head> <body> <h1> Hello World <b> Welcome </b> <p> <b> Beautiful Soup <i> Tutorial </i> </b> </p> <p> <b> </b> </p> </h1> </body> </html> With lxml parser, note where the closing </h1> is inserted. Also the incomplete <b> tag is rectified, and the dangling </a> is removed. Trying to parse your markup with lxml Here”s what lxml did with the markup: <html> <body> <h1> Hello World <b> Welcome </b> </h1> <p> <b> Beautiful Soup <i> Tutorial </i> </b> </p> <p> </p> </body> </html> The diagnose() method parses the document as XML document also, which probably is superfluous in our case. Trying to parse your markup with lxml-xml Here”s what lxml-xml did with the markup: <?xml version=”1.0″ encoding=”utf-8″?> <h1> Hello World <b> Welcome </b> <P> <b> Beautiful Soup </b> <i> Tutorial </i> <p/> </P> </h1> Let us give the diagnose() method a XML document instead of HTML document. <?xml version=”1.0″ ?> <books> <book> <title>Python</title> <author>TutorialsPoint</author> <price>400</price> </book> </books> Now if we run the diagnostics, even if it”s a XML, the html parsers are applied. Trying to parse your markup with html.parser Warning (from warnings module): File “C:UsersmlathOneDriveDocumentsFeb23 onwardsBeautifulSoupLibsite-packagesbs4builder__init__.py”, line 545 warnings.warn( XMLParsedAsHTMLWarning: It looks like you”re parsing an XML document using an HTML parser. If this really is an HTML document (maybe it”s XHTML?), you can ignore or filter this warning. If it”s XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=”xml”` into the BeautifulSoup constructor. With html.parser, a warning message is displayed. With html5lib, the fist line which contains XML version information is commented and rest of the document is parsed as if it is a HTML document. Trying to parse your markup with html5lib Here”s what html5lib did with the markup: <!–?xml version=”1.0″ ?–> <html> <head> </head> <body> <books> <book> <title> Python </title> <author> TutorialsPoint </author> <price> 400 </price> </book> </books> </body> </html> The lxml html parser doesn”t insert the comment, but parses it as HTML. Trying to parse your markup with lxml Here”s what lxml did with the markup: <?xml version=”1.0″ ?> <html> <body> <books> <book> <title> Python </title> <author> TutorialsPoint </author> <price> 400 </price> </book> </books> </body> </html> The lxml-xml parser parses the document as XML. Trying to parse your markup with lxml-xml Here”s what lxml-xml did with the markup: <?xml version=”1.0″ encoding=”utf-8″?> <?xml version=”1.0″ ?> <books> <book> <title> Python </title> <author> TutorialsPoint </author> <price> 400 </price> </book> </books> The diagnostics report may prove to be useful in finding errors in HTML/XML documents. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Find all Children of an Element

Beautiful Soup – Find all Children of an Element ”; Previous Next The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The top level element is called as parent. The elements nested inside the parent are its children. With the help of Beautiful Soup, we can find all the children elements of a parent element. In this chapter, we shall find out how to obtain the children of a HTML element. There are two provisions in BeautifulSoup class to fetch the children elements. The .children property The findChildren() method Examples in this chapter use the following HTML script (index.html) <html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul id=”HR”> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> Using .children property The .children property of a Tag object returns a generator of all the child elements in a recursive manner. The following Python code gives a list of all the children elements of top level <ul> tag. We first obtain the Tag element corresponding to the <ul> tag, and then read its .children property Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.ul print (list(tag.children)) Output [”n”, <li>Accounts</li>, ”n”, <ul> <li>Anand</li> <li>Mahesh</li> </ul>, ”n”, <li>HR</li>, ”n”, <ul> <li>Rani</li> <li>Ankita</li> </ul>, ”n”] Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy. for child in tag.children: print (child) Output <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> Using findChildren() method The findChildren() method offers a more comprehensive alternative. It returns all the child elements under any top level tag. In the index.html document, we have two nested unordered lists. The top level <ul> element has id = “dept” and the two enclosed lists are having id = “acc” and “HR” respectively. In the following example, we first instantiate a Tag object pointing to top level <ul> element and extract the list of children under it. from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(“ul”, {“id”: “dept”}) children = tag.findChildren() for child in children: print(child) Note that the resultset includes the children under an element in a recursive fashion. Hence, in the following output, you”ll find the entire inner list, followed by individual elements in it. <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>Anand</li> <li>Mahesh</li> <li>HR</li> <ul id=”HR”> <li>Rani</li> <li>Ankita</li> </ul> <li>Rani</li> <li>Ankita</li> Let us extract the children under an inner <ul> element with id=”acc”. Here is the code − Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tag = soup.find(“ul”, {“id”: “acc”}) children = tag.findChildren() for child in children: print(child) When the above program is run, you”ll obtain the <li>elements under the <ul> with id as acc. Output <li>Anand</li> <li>Mahesh</li> Thus, BeautifulSoup makes it very easy to parse the children elements under any top level HTML element. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Find Element using CSS Selectors

Beautiful Soup – Find Element using CSS Selectors ”; Previous Next In Beautiful Soup library, the select() method is an important tool for scraping the HTML/XML document. Similar to find() and the other find_*() methods, the select() method also helps in locating an element that satisfies a given criteria. However, the find*() methods search for the PageElements according to the Tag name and its attributes, the select() method searches the document tree for the given CSS selector. Beautiful Soup also has select_one() method. Difference in select() and select_one() is that, select() returns a ResultSet of all the elements belonging to the PageElement and characterized by the CSS selector; whereas select_one() returns the first occurrence of the element satisfying the CSS selector based selection criteria. Prior to Beautiful Soup version 4.7, the select() method used to be able to support only the common CSS selectors. With version 4.7, Beautiful Soup was integrated with Soup Sieve CSS selector library. As a result, much more selectors can now be used. In the version 4.12, a .css property has been added in addition to the existing convenience methods, select() and select_one().The parameters for select() method are as follows − select(selector, limit, **kwargs) selector − A string containing a CSS selector. limit − After finding this number of results, stop looking. kwargs − Keyword arguments to be passed. If the limit parameter is set to 1, it becomes equivalent to select_one() method. While the select() method returns a ResultSet of Tag objects, the select_one() method returns a single Tag object. Soup Sieve Library Soup Sieve is a CSS selector library. It has been integrated with Beautiful Soup 4, so it is installed along with Beautiful Soup package. It provides ability to select, match, and filter he document tree tags using modern CSS selectors. Soup Sieve currently implements most of the CSS selectors from the CSS level 1 specifications up to CSS level 4, except for some that are not yet implemented. The Soup Sieve library has different types of CSS selectors. The basic CSS selectors are − Type selector Matching elements is done by node name. For example − tags = soup.select(”div”) Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tags = soup.select(”div”) print (tags) Output [<div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>] Universal selector (*) It matches elements of any type. Example − tags = soup.select(”*”) ID selector It matches an element based on its id attribute. The symbol # denotes the ID selector. Example − tags = soup.select(“#nm”) Example from bs4 import BeautifulSoup html = ””” <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> ””” soup = BeautifulSoup(html, ”html.parser”) obj = soup.select(“#nm”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>] Class selector It matches an element based on the values contained in the class attribute. The . symbol prefixed to the class name is the CSS class selector. Example − tags = soup.select(“.submenu”) Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tags = soup.select(”div”) print (tags) Output [<div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div>] Attribute Selectors The attribute selector matches an element based on its attributes. soup.select(”[attr]”) Example from bs4 import BeautifulSoup html = ””” <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a> <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a> ””” soup = BeautifulSoup(html, ”html5lib”) print(soup.select(”[href]”)) Output [<a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a>, <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a>] Pseudo Classes CSS specification defines a number of pseudo CSS classes. A pseudo-class is a keyword added to a selector so as to define a special state of the selected elements. It adds an effect to the existing elements. For example, :link selects a link (every <a> and <area> element with an href attribute) that has not yet been visited. The pseudo-class selectors nth-of-type and nth-child are very widely used. :nth-of-type() The selector :nth-of-type() matches elements of a given type, based on their position among a group of siblings. The keywords even and odd, and will respectively select elements, from a sub-group of sibling elements. In the following example, second element of <p> type is selected. Example from bs4 import BeautifulSoup html = ””” <p id=”0″></p> <p id=”1″></p> <span id=”2″></span> <span id=”3″></span> ””” soup = BeautifulSoup(html, ”html5lib”) print(soup.select(”p:nth-of-type(2)”)) Output [<p id=”1″></p>] :nth-child() This selector matches elements based on their position in a group of siblings. The keywords even and odd will respectively select elements whose position is either even or odd amongst a group of siblings. Usage :nth-child(even) :nth-child(odd) :nth-child(2) Example from bs4 import BeautifulSoup, NavigableString markup = ””” <div id=”Languages”> <p>Java</p> <p>Python</p> <p>C++</p> </div> ””” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.div child = tag.select_one(”:nth-child(2)”) print (child) Output <p>Python</p> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Find all Comments

Beautiful Soup – Find all Comments ”; Previous Next Inserting comments in a computer code is supposed to be a good programming practice. Comments are helpful for understanding the logic of the program. They also serve as a documentation. You can put comments in a HTML as well as XML script, just as in a program written in C, Java, Python etc. BeautifulSoup API can be helpful to identify all the comments in a HTML document. In HTML and XML, the comment text is written between <!– and –> tags. <!– Comment Text –> The BeutifulSoup package, whose internal name is bs4, defines Comment as an important object. The Comment object is a special type of NavigableString object. Hence, the string property of any Tag that is found between <!– and –> is recognized as a Comment. Example from bs4 import BeautifulSoup markup = “<b><!–This is a comment text in HTML–></b>” soup = BeautifulSoup(markup, ”html.parser”) comment = soup.b.string print (comment, type(comment)) Output This is a comment text in HTML <class ”bs4.element.Comment”> To search for all the occurrences of comment in a HTML document, we shall use find_all() method. Without any argument, find_all() returns all the elements in the parsed HTML document. You can pass a keyword argument ”string” to find_all() method. We shall assign the return value of a function iscomment() to it. comments = soup.find_all(string=iscomment) The iscomment() function verifies if the text in a tag is a comment object or not, with the help of isinstance() function. def iscomment(elem): return isinstance(elem, Comment) The comments variable shall store all the comment text occurrences in the given HTML document. We shall use the following index.html file in the example code − <html> <head> <!– Title of document –> <title>TutorialsPoint</title> </head> <body> <!– Page heading –> <h2>Departmentwise Employees</h2> <!– top level list–> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <!– first inner list –> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul id=”HR”> <!– second inner list –> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> The following Python program scrapes the above HTML document, and finds all the comments in it. Example from bs4 import BeautifulSoup, Comment fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) def iscomment(elem): return isinstance(elem, Comment) comments = soup.find_all(string=iscomment) print (comments) Output [” Title of document ”, ” Page heading ”, ” top level list”, ” first inner list ”, ” second inner list ”] The above output shows a list of all comments. We can also use a for loop over the collection of comments. Example i=0 for comment in comments: i+=1 print (i,”.”,comment) Output 1 . Title of document 2 . Page heading 3 . top level list 4 . first inner list 5 . second inner list In this chapter, we learned how to extract all the comment strings in a HTML document. Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Find Elements by ID

Beautiful Soup – Find Elements by ID ”; Previous Next In an HTML document, usually each element is assigned a unique ID. This enables the value of an element to be extracted by a front-end code such as JavaScript function. With BeautifulSoup, you can find the contents of a given element by its ID. There are two methods by which this can be achieved – find() as well as find_all(), and select() Using find() method The find() method of BeautifulSoup object searches for first element that satisfies the given criteria as an argument. Let us use the following HTML script (as index.html) for the purpose <html> <head> <title>TutorialsPoint</title> </head> <body> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> The following Python code finds the element with its id as nm Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find(id = ”nm”) print (obj) Output <input id=”nm” name=”name” type=”text”/> Using find_all() The find_all() method also accepts a filter argument. It returns a list of all the elements with the given id. In a certain HTML document, usually a single element with a particular id. Hence, using find() instead of find_all() is preferrable to search for a given id. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find_all(id = ”nm”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>] Note that the find_all() method returns a list. The find_all() method also has a limit parameter. Setting limit=1 to find_all() is equivalent to find() obj = soup.find_all(id = ”nm”, limit=1) Using select() method The select() method in BeautifulSoup class accepts CSS selector as an argument. The # symbol is the CSS selector for id. It followed by the value of required id is passed to select() method. It works as the find_all() method. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.select(“#nm”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>] Using select_one() Like the find_all() method, the select() method also returns a list. There is also a select_one() method to return the first tag of the given argument. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.select_one(“#nm”) print (obj) Output <input id=”nm” name=”name” type=”text”/> Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Scraping Paragraphs from HTML

Beautiful Soup – Scraping Paragraphs from HTML ”; Previous Next One of the frequently appearing tags in a HTML document is the <p> tag that marks a paragraph text. With Beautiful Soup, you can easily extract paragraph from the parsed document tree. In this chapter, we shall discuss the following ways of scraping paragraphs with the help of BeautifulSoup library. Scraping HTML paragraph with <p> tag Scraping HTML paragraph with find_all() method Scraping HTML paragraph with select() method We shall use the following HTML document for these exercises − <html> <head> <title>BeautifulSoup – Scraping Paragraph</title> </head> <body> <p id=”para1”>The quick, brown fox jumps over a lazy dog.</p> <h2>Hello</h2> <p>DJs flock by when MTV ax quiz prog.</p> <p>Junk MTV quiz graced by fox whelps.</p> <p>Bawds jog, flick quartz, vex nymphs.</p> </body> </html> Scraping by <p> tag Easiest way to search a parse tree is to search the tag by its name. Hence, the expression soup.p points towards the first <p> tag in the scouped document. para = soup.p To fetch all the subsequent <p> tags, you can run a loop till the soup object is exhausted of all the <p> tags. The following program displays the prettified output of all the paragraph tags. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) para = soup.p print (para.prettify()) while True: p = para.find_next(”p”) if p is None: break print (p.prettify()) para=p Output <p> The quick, brown fox jumps over a lazy dog. </p> <p> DJs flock by when MTV ax quiz prog. </p> <p> Junk MTV quiz graced by fox whelps. </p> <p> Bawds jog, flick quartz, vex nymphs. </p> Using find_all() method The find_all() methods is more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to this method. In this case, we want to fetch the contents of a <p> tag. In the following code, find_all() method returns a list of all elements in the <p> tag. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) paras = soup.find_all(”p”) for para in paras: print (para.prettify()) Output <p> The quick, brown fox jumps over a lazy dog. </p> <p> DJs flock by when MTV ax quiz prog. </p> <p> Junk MTV quiz graced by fox whelps. </p> <p> Bawds jog, flick quartz, vex nymphs. </p> We can use another approach to find all <p> tags. To begin with, obtain list of all tags using find_all() and check Tag.name of each equals =”p”. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all() paras = [tag.contents for tag in tags if tag.name==”p”] print (paras) The find_all() method also has attrs parameter. It is useful when you want to extract the <p> tag with specific attributes. For example, in the given document, the first <p> element has id=”para1”. To fetch it, we need to modify the tag object as − paras = soup.find_all(”p”, attrs={”id”:”para1”}) Using select() method The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the <p> tag to select() method. The select_one() method is also available. It fetches the first occurrence of the <p> tag. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) paras = soup.select(”p”) print (paras) Output [ <p>The quick, brown fox jumps over a lazy dog.</p>, <p>DJs flock by when MTV ax quiz prog.</p>, <p>Junk MTV quiz graced by fox whelps.</p>, <p>Bawds jog, flick quartz, vex nymphs.</p> ] To filter out <p> tags with a certain id, use a for loop as follows − Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.select(”p”) for tag in tags: if tag.has_attr(”id”) and tag[”id”]==”para1”: print (tag.contents) Output [”The quick, brown fox jumps over a lazy dog.”] Print Page Previous Next Advertisements ”;

Aug 09

Beautiful Soup – Get all HTML Tags

Beautiful Soup – Get all HTML Tags ”; Previous Next Tags in HTML are like keywords in a traditional programming language like Python or Java. Tags have a predefined behaviour according to which the its content is rendered by the browser. With Beautiful Soup, it is possible to collect all the tags in a given HTML document. The simplest way to obtain a list of tags is to parse the web page into a soup object, and call find_all() methods without any argument. It returns a list generator, giving us a list of all the tags. Let us extract the list of all tags in Google”s homepage. Example from bs4 import BeautifulSoup import requests url = “https://www.google.com/” req = requests.get(url) soup = BeautifulSoup(req.content, “html.parser”) tags = soup.find_all() print ([tag.name for tag in tags]) Output [”html”, ”head”, ”meta”, ”meta”, ”title”, ”script”, ”style”, ”style”, ”script”, ”body”, ”script”, ”div”, ”div”, ”nobr”, ”b”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”u”, ”div”, ”nobr”, ”span”, ”span”, ”span”, ”a”, ”a”, ”a”, ”div”, ”div”, ”center”, ”br”, ”div”, ”img”, ”br”, ”br”, ”form”, ”table”, ”tr”, ”td”, ”td”, ”input”, ”input”, ”input”, ”input”, ”input”, ”div”, ”input”, ”br”, ”span”, ”span”, ”input”, ”span”, ”span”, ”input”, ”script”, ”input”, ”td”, ”a”, ”input”, ”script”, ”div”, ”div”, ”br”, ”div”, ”style”, ”div”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”a”, ”span”, ”div”, ”div”, ”a”, ”a”, ”a”, ”a”, ”p”, ”a”, ”a”, ”script”, ”script”, ”script”] Naturally, you may get such a list where one certain tag may appear more than once. To obtain a list of unique tags (avoiding the duplication), construct a set from the list of tag objects. Change the print statement in above code to Example print ({tag.name for tag in tags}) Output {”body”, ”head”, ”p”, ”a”, ”meta”, ”tr”, ”nobr”, ”script”, ”br”, ”img”, ”b”, ”form”, ”center”, ”span”, ”div”, ”input”, ”u”, ”title”, ”style”, ”td”, ”table”, ”html”} To obtain tags with some text associated with them, check the string property and print if it is not None tags = soup.find_all() for tag in tags: if tag.string is not None: print (tag.name, tag.string) There may be some singleton tags without text but with one or more attributes as in the <img> tag. Following loop constructs lists out such tags. In the following code, the HTML string is not a complete HTML document in the sense that thr <html> and <body> tags are not given. But the html5lib and lxml parsers add these tags on their own while parsing the document tree. Hence, when we extract the tag list, the additional tags will also be seen. Example html = ””” <h1 style=”color:blue;text-align:center;”>This is a heading</h1> <p style=”color:red;”>This is a paragraph.</p> <p>This is another paragraph</p> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html5lib”) tags = soup.find_all() print ({tag.name for tag in tags} ) Output {”head”, ”html”, ”p”, ”h1”, ”body”} Print Page Previous Next Advertisements ”;