Beautiful Soup – Scraping List from HTML ”; Previous Next Web pages usually contain important data in the formation in the form of ordered or unordered lists. With Beautiful Soup, we can easily extract the HTML list elements, bring the data in Python objects to store in databases for further analysis. In this chapter, we shall use find() and select() methods to scrape the list data from a HTML document. Easiest way to search a parse tree is to search the tag by its name. soup.<tag> fetches the contents of the given tag. HTML provides <ol> and <ul> tags to compose ordered and unordered lists. Like any other tag, we can fetch the contents of these tags. We shall use the following HTML document − <html> <body> <h2>Departmentwise Employees</h2> <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> </body> </html> Scraping lists by Tag In the above HTML document, we have a top-level <ul> list, inside which there”s another <ul> tag and another <ol> tag. We first parse the document in soup object and retrieve contents of first <ul> in soup.ul Tag object. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.ul print (lst) Output <ul id=”dept”> <li>Accounts</li> <ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> </ul> Change value of lst to point to <ol> element to get the inner list. lst=soup.ol Output <ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol> Using select() method The select() method is essentially used to obtain data using CSS selector. However, you can also pass a tag to it. Here, we can pass the ol tag to select() method. The select_one() method is also available. It fetches the first occurrence of the given tag. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.select(“ol”) print (lst) Output [<ol id=”HR”> <li>Rani</li> <li>Ankita</li> </ol>] Using find_all() method The find() and fin_all() methods are more comprehensive. You can pass various types of filters such as tag, attributes or string etc. to these methods. In this case, we want to fetch the contents of a list tag. In the following code, find_all() method returns a list of all elements in the <ul> tag. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.find_all(“ul”) print (lst) We can refine the search filter by including the attrs argument. In our HTML document, the <ul> and <ol> tags, we have specified their respective id attributes. So, let us fetch the contents of <ul> element having id=”acc”. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.find_all(“ul”, {“id”:”acc”}) print (lst) Output [<ul id=”acc”> <li>Anand</li> <li>Mahesh</li> </ul>] Here”s another example. We collect all elements with <li> tag with the inner text starting with ”A”. The find_all() method takes a keyword argument string. It takes the value of the text if the startingwith() function returns True. Example from bs4 import BeautifulSoup def startingwith(ch): return ch.startswith(”A”) fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) lst=soup.find_all(”li”,string=startingwith) print (lst) Output [<li>Accounts</li>, <li>Anand</li>, <li>Ankita</li>] Print Page Previous Next Advertisements ”;
Category: beautiful Soup
Beautiful Soup – Scrape HTML Content ”; Previous Next The process of extracting data from websites is called Web scraping. A web page may have urls, Email addresses, images or any other content, which we can be stored in a file or database. Searching a website manually is cumbersome process. There are different web scaping tools that automate the process. Web scraping is is sometimes prohibited by the use of ”robots.txt” file. Some popular sites provide APIs to access their data in a structured way. Unethical web scraping may result in getting your IP blocked. Python is widely used for web scraping. Python standard library has urllib package, which can be used to extract data from HTML pages. Since urllib module is bundled with the standard library, it need not be installed. The urllib package is an HTTP client for python programming language. The urllib.request module is usefule when we want to open and read URLs. Other module in urllib package are − urllib.error defines the exceptions and errors raised by the urllib.request command. urllib.parse is used for parsing URLs. urllib.robotparser is used for parsing robots.txt files. Use the urlopen() function in urllib module to read the content of a web page from a website. import urllib.request response = urllib.request.urlopen(”http://python.org/”) html = response.read() You can also use the requests library for this purpose. You need to install it before using. pip3 install requests In the below code, the homepage of http://www.tutorialspoint.com is scraped − from bs4 import BeautifulSoup import requests url = “https://www.tutorialspoint.com/index.htm” req = requests.get(url) The content obtained by either of the above two methods are then parsed with Beautiful Soup. Print Page Previous Next Advertisements ”;
Beautiful Soup – Navigating by Tags ”; Previous Next One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag”s children). Beautiful Soup provides different ways to navigate and iterate over”s tag”s children. Easiest way to search a parse tree is to search the tag by its name. soup.head The soup.head function returns the contents put inside the <head> .. </head> element of a HTML page. Consider the following HTML page to be scraped: <html> <head> <title>TutorialsPoint</title> <script> document.write(“Welcome to TutorialsPoint”); </script> </head> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> </html> Following code extracts the contents of <head> element Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.head) Output <head> <title>TutorialsPoint</title> <script> document.write(“Welcome to TutorialsPoint”); </script> </head> soup.body Similarly, to return the contents of body part of HTML page, use soup.body Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print (soup.body) Output <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> You can also extract specific tag (like first <h1> tag) in the <body> tag. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.body.h1) Output <h1>Tutorialspoint Online Library</h1> soup.p Our HTML file contains a <p> tag. We can extract the contents of this tag Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup.p) Output <p><b>It”s all Free</b></p> Tag.contents A Tag object may have one or more PageElements. The Tag object”s contents property returns a list of all elements included in it. Let us find the elements in <head> tag of our index.html file. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.head print (tag.contents) Output [”n”, <title>TutorialsPoint</title>, ”n”, <script> document.write(“Welcome to TutorialsPoint”); </script>, ”n”] Tag.children The structure of tags in a HTML script is hierarchical. The elements are nested one inside the other. For example, the top level <HTML> tag includes <HEAD> and <BODY> tags, each may have other tags in it. The Tag object has a children property that returns a list iterator object containing the enclosed PageElements. To demonstrate the children property, we shall use the following HTML script (index.html). In the <body> section, there are two <ul> list elements, one nested in another. In other words, the body tag has top level list elements, and each list element has another list under it. <html> <head> <title>TutorialsPoint</title> </head> <body> <h2>Departmentwise Employees</h2> <ul> <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> </ul> </body> </html> The following Python code gives a list of all the children elements of top level <ul> tag. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) tag = soup.ul print (list(tag.children)) Output [”n”, <li>Accounts</li>, ”n”, <ul> <li>Anand</li> <li>Mahesh</li> </ul>, ”n”, <li>HR</li>, ”n”, <ul> <li>Rani</li> <li>Ankita</li> </ul>, ”n”] Since the .children property returns a list_iterator, we can use a for loop to traverse the hierarchy. Example for child in tag.children: print (child) Output <li>Accounts</li> <ul> <li>Anand</li> <li>Mahesh</li> </ul> <li>HR</li> <ul> <li>Rani</li> <li>Ankita</li> </ul> Tag.find_all() This method returns a result set of contents of all the tags matching with the argument tag provided. Let us consider the following HTML page(index.html) for this − <html> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a> <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a> <a class=”prog” href=”https://www.tutorialspoint.com/python/index.htm” id=”link3″>Python</a> <a class=”prog” href=”https://www.tutorialspoint.com/javascript/javascript_overview.htm” id=”link4″>JavaScript</a> <a class=”prog” href=”https://www.tutorialspoint.com/ruby/index.htm” id=”link5″>C</a> </body> </html> The following code lists all the elements with <a> tag Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) result = soup.find_all(“a”) print (result) Output [ <a class=”prog” href=”https://www.tutorialspoint.com/java/java_overview.htm” id=”link1″>Java</a>, <a class=”prog” href=”https://www.tutorialspoint.com/cprogramming/index.htm” id=”link2″>C</a>, <a class=”prog” href=”https://www.tutorialspoint.com/python/index.htm” id=”link3″>Python</a>, <a class=”prog” href=”https://www.tutorialspoint.com/javascript/javascript_overview.htm” id=”link4″>JavaScript</a>, <a class=”prog” href=”https://www.tutorialspoint.com/ruby/index.htm” id=”link5″>C</a> ] Print Page Previous Next Advertisements ”;
Beautiful Soup – Find Elements by Attribute ”; Previous Next Both find() and find_all() methods are meant to find one or all the tags in the document as per the arguments passed to these methods. You can pass attrs parameter to these functions. The value of attrs must be a dictionary with one or more tag attributes and their values. For the purpose of checking the behaviour of these methods, we shall use the following HTML document (index.html) <html> <head> <title>TutorialsPoint</title> </head> <body> <form> <input type = ”text” id = ”nm” name = ”name”> <input type = ”text” id = ”age” name = ”age”> <input type = ”text” id = ”marks” name = ”marks”> </form> </body> </html> Using find_all() The following program returns a list of all the tags having input type=”text” attribute. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.find_all(attrs={“type”:”text”}) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Using find() The find() method returns the first tag in the parsed document that has the given attributes. obj = soup.find(attrs={“name”:”marks”}) Using select() The select() method can be called by passing the attributes to be compared against. The attributes must be put in a list object. It returns a list of all tags that have the given attribute. In the following code, the select() method returns all the tags with type attribute. Example from bs4 import BeautifulSoup fp = open(“index.html”) soup = BeautifulSoup(fp, ”html.parser”) obj = soup.select(“[type]”) print (obj) Output [<input id=”nm” name=”name” type=”text”/>, <input id=”age” name=”age” type=”text”/>, <input id=”marks” name=”marks” type=”text”/>] Using select_one() The select_one() is method is similar, except that it returns the first tag satisfying the given filter. obj = soup.select_one(“[name=”marks”]”) Output <input id=”marks” name=”marks” type=”text”/> Print Page Previous Next Advertisements ”;
Beautiful Soup – Souping the Page ”; Previous Next It is time to test our Beautiful Soup package in one of the html pages (taking web page – https://www.tutorialspoint.com/index.htm, you can choose any-other web page you want) and extract some information from it. In the below code, we are trying to extract the title from the webpage − Example from bs4 import BeautifulSoup import requests url = “https://www.tutorialspoint.com/index.htm” req = requests.get(url) soup = BeautifulSoup(req.content, “html.parser”) print(soup.title) Output <title>Online Courses and eBooks Library<title> One common task is to extract all the URLs within a webpage. For that we just need to add the below line of code − for link in soup.find_all(”a”): print(link.get(”href”)) Output Shown below is the partial output of the above loop − https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/about/about_careers.htm https://www.tutorialspoint.com/whiteboard.htm https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/business/index.asp https://www.tutorialspoint.com/market/teach_with_us.jsp https://www.facebook.com/tutorialspointindia https://www.instagram.com/tutorialspoint_/ https://twitter.com/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.tutorialspoint.com/categories/development https://www.tutorialspoint.com/categories/it_and_software https://www.tutorialspoint.com/categories/data_science_and_ai_ml https://www.tutorialspoint.com/categories/cyber_security https://www.tutorialspoint.com/categories/marketing https://www.tutorialspoint.com/categories/office_productivity https://www.tutorialspoint.com/categories/business https://www.tutorialspoint.com/categories/lifestyle https://www.tutorialspoint.com/latest/prime-packs https://www.tutorialspoint.com/market/index.asp https://www.tutorialspoint.com/latest/ebooks … … To parse a web page stored locally in the current working directory, obtain the file object pointing to the html file, and use it as argument to the BeautifulSoup() constructor. Example from bs4 import BeautifulSoup with open(“index.html”) as fp: soup = BeautifulSoup(fp, ”html.parser”) print(soup) Output <html> <head> <title>Hello World</title> </head> <body> <h1 style=”text-align:center;”>Hello World</h1> </body> </html> You can also use a string that contains HTML script as constructor”s argument as follows − from bs4 import BeautifulSoup html = ””” <html> <head> <title>Hello World</title> </head> <body> <h1 style=”text-align:center;”>Hello World</h1> </body> </html> ””” soup = BeautifulSoup(html, ”html.parser”) print(soup) Beautiful Soup uses the best available parser to parse the document. It will use an HTML parser unless specified otherwise. Print Page Previous Next Advertisements ”;
Beautiful Soup – Overview
Beautiful Soup – Overview ”; Previous Next In today”s world, we have tons of unstructured data/information (mostly web data) available freely. Sometimes the freely available data is easy to read and sometimes not. No matter how your data is available, web scraping is very useful tool to transform unstructured data into structured data that is easier to read and analyze. In other words, web scraping is a way to collect, organize and analyze this enormous amount of data. So let us first understand what is web-scraping. Introduction to Beautiful Soup The Beautiful Soup is a python library which is named after a Lewis Carroll poem of the same name in “Alice”s Adventures in the Wonderland”. Beautiful Soup is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable XML structures. In short, Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents. HTML tree Structure Before we look into the functionality provided by Beautiful Soup, let us first understand the HTML tree structure. The root element in the document tree is the html, which can have parents, children and siblings and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure. Let us suppose the webpage is as shown below − Which translates to an html document as follows − <html> <head> <title>TutorialsPoint</title> </head> <body> <h1>Tutorialspoint Online Library</h1> <p><b>It”s all Free</b></p> </body> </html> Which simply means, for above html document, we have a html tree structure as follows − Print Page Previous Next Advertisements ”;
Beautiful Soup – Remove all Styles ”; Previous Next This chapter explains how to remove all styles from a HTML document. Cascaded style sheets (CSS) are used to control the appearance of different aspects of a HTML document. It includes styling the rendering of text with a specific font, color, alignment, spacing etc. CSS is applied to HTML tags in different ways. One is to define different styles in a CSS file and include in the HTML script with the <link> tag in the <head> section in the document. For example, Example <html> <head> <link rel=”stylesheet” href=”style.css”> </head> <body> . . . . . . </body> </html> The different tags in the body part of the HTML script will use the definitions in mystyle.css file Another approach is to define the style configuration inside the <head> part of the HTML document itself. Tags in the body part will be rendered by using the definitions provided internally. Example of internal styling − <html> <head> <style> p { text-align: center; color: red; } </style> </head> <body> <p>para1.</p> <p id=”para1″>para2</p> <p>para3</p> </body> </html> In either cases, to remove the styles programmatically, simple remove the head tag from the soup object. from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.head.extract() Third approach is to define the styles inline by including style attribute in the tag itself. The style attribute may contain one or more style attribute definitions such as color, size etc. For example <body> <h1 style=”color:blue;text-align:center;”>This is a heading</h1> <p style=”color:red;”>This is a paragraph.</p> </body> To remove such inline styles from a HTML document, you need to check if attrs dictionary of a tag object has style key defined in it, and if yes delete the same. tags=soup.find_all() for tag in tags: if tag.has_attr(”style”): del tag.attrs[”style”] print (soup) The following code removes the inline styles as well as removes the head tag itself, so that the resultant HTML tree will not have any styles left. html = ””” <html> <head> <link rel=”stylesheet” href=”style.css”> </head> <body> <h1 style=”color:blue;text-align:center;”>This is a heading</h1> <p style=”color:red;”>This is a paragraph.</p> </body> </html> ””” from bs4 import BeautifulSoup soup = BeautifulSoup(html, “html.parser”) soup.head.extract() tags=soup.find_all() for tag in tags: if tag.has_attr(”style”): del tag.attrs[”style”] print (soup.prettify()) Output <html> <body> <h1> This is a heading </h1> <p> This is a paragraph. </p> </body> </html> Print Page Previous Next Advertisements ”;
Beautiful Soup – Inspect Data Source ”; Previous Next In order to scrape a web page with BeautifulSoup and Python, your first step for any web scraping project should be to explore the website that you want to scrape. So, first visit the website to understand the site structure before you start extracting the information that”s relevant for you. Let us visit TutorialsPoint”s Python Tutorial home page. Open https://www.tutorialspoint.com/python3/index.htm in your browser. Use Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed. If using Chrome browser, open the Developer Tools from the top-right menu button (⋮) and selecting More Tools → Developer Tools. With Developer tools, you can explore the site”s document object model (DOM) to better understand your source. Select the Elements tab in developer tools. You”ll see a structure with clickable HTML elements. The Tutorial page shows the table of contents in the left sidebar. Right click on any chapter and choose Inspect option. For the Elements tab, locate the tag that corresponds to the TOC list, as shown in the figure below − Right click on the HTML element, copy the HTML element, and paste it in any editor. The HTML script of the <ul>..</ul> element is now obtained. <ul class=”toc chapters”> <li class=”heading”>Python 3 Basic Tutorial</li> <li class=”current-chapter”><a href=”/python3/index.htm”>Python 3 – Home</a></li> <li><a href=”/python3/python3_whatisnew.htm”>What is New in Python 3</a></li> <li><a href=”/python3/python_overview.htm”>Python 3 – Overview</a></li> <li><a href=”/python3/python_environment.htm”>Python 3 – Environment Setup</a></li> <li><a href=”/python3/python_basic_syntax.htm”>Python 3 – Basic Syntax</a></li> <li><a href=”/python3/python_variable_types.htm”>Python 3 – Variable Types</a></li> <li><a href=”/python3/python_basic_operators.htm”>Python 3 – Basic Operators</a></li> <li><a href=”/python3/python_decision_making.htm”>Python 3 – Decision Making</a></li> <li><a href=”/python3/python_loops.htm”>Python 3 – Loops</a></li> <li><a href=”/python3/python_numbers.htm”>Python 3 – Numbers</a></li> <li><a href=”/python3/python_strings.htm”>Python 3 – Strings</a></li> <li><a href=”/python3/python_lists.htm”>Python 3 – Lists</a></li> <li><a href=”/python3/python_tuples.htm”>Python 3 – Tuples</a></li> <li><a href=”/python3/python_dictionary.htm”>Python 3 – Dictionary</a></li> <li><a href=”/python3/python_date_time.htm”>Python 3 – Date & Time</a></li> <li><a href=”/python3/python_functions.htm”>Python 3 – Functions</a></li> <li><a href=”/python3/python_modules.htm”>Python 3 – Modules</a></li> <li><a href=”/python3/python_files_io.htm”>Python 3 – Files I/O</a></li> <li><a href=”/python3/python_exceptions.htm”>Python 3 – Exceptions</a></li> </ul> We can now load this script in a BeautifulSoup object to parse the document tree. Print Page Previous Next Advertisements ”;
Beautiful Soup – Modifying the Tree ”; Previous Next One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents. Beautiful Soup library has different functions to perform the following operations − Add contents or a new tag to an existing tag of the document Insert contents before or after an existing tag or string Clear the contents of an already existing tag Modify the contents of a tag element Add content You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python”s list object. In the following example, the HTML script has a <p> tag. With append(), additional text is appended. Example from bs4 import BeautifulSoup markup = ”<p>Hello</p>” soup = BeautifulSoup(markup, ”html.parser”) print (soup) tag = soup.p tag.append(” World”) print (soup) Output <p>Hello</p> <p>Hello World</p> With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method. Example from bs4 import BeautifulSoup, Tag markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”i”) tag1.string = ”World” tag.append(tag1) print (soup.prettify()) Output <b> Hello <i> World </i> </b> If you have to add a string to the document, you can append a NavigableString object. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b new_string = NavigableString(” World”) tag.append(new_string) print (soup.prettify()) Output <b> Hello World </b> From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag. Example from bs4 import BeautifulSoup markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b vals = [”World.”, ”Welcome to ”, ”TutorialsPoint”] tag.extend(vals) print (soup.prettify()) Output <b> Hello World. Welcome to TutorialsPoint </b> Insert Contents Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object. In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert(1, “Tutorial “) print (soup.prettify()) Output <b> Excellent Tutorial </b> <u> from TutorialsPoint </u> Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string “Python Tutorial” is added after the <b> tag. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_after(“Python Tutorial”) print (soup.prettify()) Output <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> On the other hand, insert_before() method is used below, to add “Here is an ” text before the <b> tag. tag.insert_before(“Here is an “) print (soup.prettify()) Output Here is an <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> Clear the Contents Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features. The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.find(”u”) tag.clear() print (soup.prettify()) Output <b> Excellent </b> <u> </u> It can be seen that the clear() method removes the contents, keeping the tag intact. For the following example, we parse the following HTML document and call clear() metho on all tags. <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs./p> </body> </html> Here is the Python code using clear() method Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all() for tag in tags: tag.clear() print (soup.prettify()) Output <html> </html> The extract() method removes either a tag or a string from the document tree, and returns the object that was removed. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all() for tag in tags: obj = tag.extract() print (“Extracted:”,obj) print (soup) Output Extracted: <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> </html> Extracted: <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> Extracted: <p> The quick, brown fox jumps over a lazy dog.</p> Extracted: <p> DJs flock by when MTV ax quiz prog.</p> Extracted: <p> Junk MTV quiz graced by fox whelps.</p> Extracted: <p> Bawds jog, flick quartz, vex
Beautiful Soup – web-scraping ”; Previous Next Scraping is simply a process of extracting (from various means), copying and screening of data. When we scrape or extract data or feeds from the web (like from web-pages or websites), it is termed as web-scraping. So, web scraping (which is also known as web data extraction or web harvesting) is the extraction of data from web. In short, web scraping provides a way to the developers to collect and analyze data from the internet. Why Web-scraping? Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways − Data for Research Smart analyst (like researcher or journalist) uses web scrapper instead of manually collecting and cleaning data from the websites. Products, prices & popularity comparison Currently there are couple of services which use web scrappers to collect data from numerous online sites and use it to compare products popularity and prices. SEO Monitoring There are numerous SEO tools such as Ahrefs, Seobility, SEMrush, etc., which are used for competitive analysis and for pulling data from your client”s websites. Search engines There are some big IT companies whose business solely depends on web scraping. Sales and Marketing The data gathered through web scraping can be used by marketers to analyze different niches and competitors or by the sales specialist for selling content marketing or social media promotion services. Why Python for Web Scraping? Python is one of the most popular languages for web scraping as it can handle most of the web crawling related tasks very easily. Below are some of the points on why to choose python for web scraping − Ease of Use As most of the developers agree that python is very easy to code. We don”t have to use any curly braces “{ }” or semi-colons “;” anywhere, which makes it more readable and easy-to-use while developing web scrapers. Huge Library Support Python provides huge set of libraries for different requirements, so it is appropriate for web scraping as well as for data visualization, machine learning, etc. Easily Explicable Syntax Python is a very readable programming language as python syntax are easy to understand. Python is very expressive and code indentation helps the users to differentiate different blocks or scopes in the code. Dynamically-typed language Python is a dynamically-typed language, which means the data assigned to a variable tells, what type of variable it is. It saves lot of time and makes work faster. Huge Community Python community is huge which helps you wherever you stuck while writing code. Print Page Previous Next Advertisements ”;