Beautiful Soup – Modifying the Tree

Beautiful Soup – Modifying the Tree ”; Previous Next One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents. Beautiful Soup library has different functions to perform the following operations − Add contents or a new tag to an existing tag of the document Insert contents before or after an existing tag or string Clear the contents of an already existing tag Modify the contents of a tag element Add content You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python”s list object. In the following example, the HTML script has a <p> tag. With append(), additional text is appended. Example from bs4 import BeautifulSoup markup = ”<p>Hello</p>” soup = BeautifulSoup(markup, ”html.parser”) print (soup) tag = soup.p tag.append(” World”) print (soup) Output <p>Hello</p> <p>Hello World</p> With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method. Example from bs4 import BeautifulSoup, Tag markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag1 = soup.new_tag(”i”) tag1.string = ”World” tag.append(tag1) print (soup.prettify()) Output <b> Hello <i> World </i> </b> If you have to add a string to the document, you can append a NavigableString object. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b new_string = NavigableString(” World”) tag.append(new_string) print (soup.prettify()) Output <b> Hello World </b> From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag. Example from bs4 import BeautifulSoup markup = ”<b>Hello</b>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b vals = [”World.”, ”Welcome to ”, ”TutorialsPoint”] tag.extend(vals) print (soup.prettify()) Output <b> Hello World. Welcome to TutorialsPoint </b> Insert Contents Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object. In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert(1, “Tutorial “) print (soup.prettify()) Output <b> Excellent Tutorial </b> <u> from TutorialsPoint </u> Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string “Python Tutorial” is added after the <b> tag. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.b tag.insert_after(“Python Tutorial”) print (soup.prettify()) Output <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> On the other hand, insert_before() method is used below, to add “Here is an ” text before the <b> tag. tag.insert_before(“Here is an “) print (soup.prettify()) Output Here is an <b> Excellent </b> Python Tutorial <u> from TutorialsPoint </u> Clear the Contents Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features. The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage. Example from bs4 import BeautifulSoup, NavigableString markup = ”<b>Excellent </b><u>from TutorialsPoint</u>” soup = BeautifulSoup(markup, ”html.parser”) tag = soup.find(”u”) tag.clear() print (soup.prettify()) Output <b> Excellent </b> <u> </u> It can be seen that the clear() method removes the contents, keeping the tag intact. For the following example, we parse the following HTML document and call clear() metho on all tags. <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs./p> </body> </html> Here is the Python code using clear() method Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all() for tag in tags: tag.clear() print (soup.prettify()) Output <html> </html> The extract() method removes either a tag or a string from the document tree, and returns the object that was removed. Example from bs4 import BeautifulSoup fp = open(”index.html”) soup = BeautifulSoup(fp, ”html.parser”) tags = soup.find_all() for tag in tags: obj = tag.extract() print (“Extracted:”,obj) print (soup) Output Extracted: <html> <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> </html> Extracted: <body> <p> The quick, brown fox jumps over a lazy dog.</p> <p> DJs flock by when MTV ax quiz prog.</p> <p> Junk MTV quiz graced by fox whelps.</p> <p> Bawds jog, flick quartz, vex nymphs.</p> </body> Extracted: <p> The quick, brown fox jumps over a lazy dog.</p> Extracted: <p> DJs flock by when MTV ax quiz prog.</p> Extracted: <p> Junk MTV quiz graced by fox whelps.</p> Extracted: <p> Bawds jog, flick quartz, vex