Beautiful Soup – Modifying the Tree


Beautiful Soup – Modifying the Tree



”;


One of the powerful features of Beautiful Soup library is to be able to be able to manipulate the parsed HTML or XML document and modify its contents.

Beautiful Soup library has different functions to perform the following operations −

  • Add contents or a new tag to an existing tag of the document

  • Insert contents before or after an existing tag or string

  • Clear the contents of an already existing tag

  • Modify the contents of a tag element

Add content

You can add to the content of an existing tag by using append() method on a Tag object. It works like the append() method of Python”s list object.

In the following example, the HTML script has a <p> tag. With append(), additional text is appended.

Example


from bs4 import BeautifulSoup

markup = ''<p>Hello</p>''
soup = BeautifulSoup(markup, ''html.parser'')
print (soup)
tag = soup.p

tag.append(" World")
print (soup) 

Output


<p>Hello</p>
<p>Hello World</p>

With the append() method, you can add a new tag at the end of an existing tag. First create a new Tag object with new_tag() method and then pass it to the append() method.

Example


from bs4 import BeautifulSoup, Tag

markup = ''<b>Hello</b>''
soup = BeautifulSoup(markup, ''html.parser'')

tag = soup.b 
tag1 = soup.new_tag(''i'')
tag1.string = ''World''
tag.append(tag1)
print (soup.prettify()) 

Output


<b>
   Hello
   <i>
      World
   </i>
</b>

If you have to add a string to the document, you can append a NavigableString object.

Example


from bs4 import BeautifulSoup, NavigableString

markup = ''<b>Hello</b>''
soup = BeautifulSoup(markup, ''html.parser'')

tag = soup.b 
new_string = NavigableString(" World")
tag.append(new_string)
print (soup.prettify())

Output


<b>
   Hello
   World
</b>

From Beautiful Soup version 4.7 onwards, the extend() method has been added to Tag class. It adds all the elements in a list to the tag.

Example


from bs4 import BeautifulSoup

markup = ''<b>Hello</b>''
soup = BeautifulSoup(markup, ''html.parser'')

tag = soup.b 
vals = [''World.'', ''Welcome to '', ''TutorialsPoint'']
tag.extend(vals)
print (soup.prettify())

Output


<b>
   Hello
   World.
   Welcome to
   TutorialsPoint
</b>

Insert Contents

Instead of adding a new element at the end, you can use insert() method to add an element at the given position in a the list of children of a Tag element. The insert() method in Beautiful Soup behaves similar to insert() on a Python list object.

In the following example, a new string is added to the <b> tag at position 1. The resultant parsed document shows the result.

Example


from bs4 import BeautifulSoup, NavigableString

markup = ''<b>Excellent </b><u>from TutorialsPoint</u>''
soup = BeautifulSoup(markup, ''html.parser'')
tag = soup.b

tag.insert(1, "Tutorial ")
print (soup.prettify())

Output


<b>
   Excellent
   Tutorial
</b>
<u>
   from TutorialsPoint
</u>

Beautiful Soup also has insert_before() and insert_after() methods. Their respective purpose is to insert a tag or a string before or after a given Tag object. The following code shows that a string “Python Tutorial” is added after the <b> tag.

Example


from bs4 import BeautifulSoup, NavigableString

markup = ''<b>Excellent </b><u>from TutorialsPoint</u>''
soup = BeautifulSoup(markup, ''html.parser'')
tag = soup.b

tag.insert_after("Python Tutorial")
print (soup.prettify())

Output


<b>
   Excellent
</b>
Python Tutorial
<u>
   from TutorialsPoint
</u>

On the other hand, insert_before() method is used below, to add “Here is an ” text before the <b> tag.


tag.insert_before("Here is an ")
print (soup.prettify())

Output


Here is an
<b>
   Excellent
</b>
Python Tutorial
<u>
   from TutorialsPoint
</u>

Clear the Contents

Beautiful Soup provides more than one ways to remove contents of an element from the document tree. Each of these methods has its unique features.

The clear() method is the most straight-forward. It simply removes the contents of a specified Tag element. Following example shows its usage.

Example


from bs4 import BeautifulSoup, NavigableString

markup = ''<b>Excellent </b><u>from TutorialsPoint</u>''
soup = BeautifulSoup(markup, ''html.parser'')
tag = soup.find(''u'')

tag.clear()
print (soup.prettify())

Output


<b>
   Excellent
</b>
<u>
</u>

It can be seen that the clear() method removes the contents, keeping the tag intact.

For the following example, we parse the following HTML document and call clear() metho on all tags.


<html>
   <body>
      <p> The quick, brown fox jumps over a lazy dog.</p>
      <p> DJs flock by when MTV ax quiz prog.</p>
      <p> Junk MTV quiz graced by fox whelps.</p>
      <p> Bawds jog, flick quartz, vex nymphs./p>
   </body>
</html>

Here is the Python code using clear() method

Example


from bs4 import BeautifulSoup

fp = open(''index.html'')
soup = BeautifulSoup(fp, ''html.parser'')
tags = soup.find_all()
for tag in tags:
   tag.clear()
print (soup.prettify())

Output


<html>
</html>

The extract() method removes either a tag or a string from the document tree, and returns the object that was removed.

Example


from bs4 import BeautifulSoup

fp = open(''index.html'')
soup = BeautifulSoup(fp, ''html.parser'')
tags = soup.find_all()
for tag in tags:
   obj = tag.extract()
   print ("Extracted:",obj)

print (soup)

Output


Extracted: <html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
Extracted: <body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
Extracted: <p> The quick, brown fox jumps over a lazy dog.</p>
Extracted: <p> DJs flock by when MTV ax quiz prog.</p>
Extracted: <p> Junk MTV quiz graced by fox whelps.</p>
Extracted: <p> Bawds jog, flick quartz, vex nymphs.</p>

You can extract either a tag or a string. The following example shows antag being extracted.

Example


html = ''''''
   <ol id="HR">
   <li>Rani</li>
   <li>Ankita</li>
   </ol>
''''''
from bs4 import BeautifulSoup


soup = BeautifulSoup(html, ''html.parser'')
obj=soup.find(''ol'')
obj.find_next().extract()
print (soup)

Output


<ol id="HR">
   <li>Ankita</li>
</ol>

Change the extract() statement to remove inner text of first <li> element.

Example


obj.find_next().string.extract()

Output


<ol id="HR">
   <li>Ankita</li>
</ol>

There is another method decompose() that removes a tag from the tree, then completely destroys it and its contents −

Example


html = ''''''
   <ol id="HR">
      <li>Rani</li>
      <li>Ankita</li>
   </ol>
''''''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ''html.parser'')
tag1=soup.find(''ol'')
tag2 = soup.find(''li'')
tag2.decompose()
print (soup)
print (tag2.decomposed)

Output


<ol id="HR">

<li>Ankita</li>
</ol>

The decomposed property returns True or False – whether an element has been decomposed or not.

Modify the Contents

We shall look at the replace_with() method that allows contents of a tag to be replaced.

Just as a Python string, which is immutable, the NavigableString also can”t be modified in place. However, use replace_with() to replace the inner string of a tag with another.

Example


from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id=''message''>Hello, Tutorialspoint!</h2>",''html.parser'')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

Output


OnLine Tutorials Library

Here is another example to show the use of replace_with(). Two parsed documents can be combined if you pass a BeautifulSoup object as an argument to a certain function such as replace_with().2524

Example


from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find(''b'').replace_with(obj1)
print (obj2)

Output


<html><body><book><title>Python</title></book></body></html>

The wrap() method wraps an element in the tag you specify. It returns the new wrapper.


from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello Python</p>", ''html.parser'')
tag = soup.p
newtag = soup.new_tag(''b'')
tag.string.wrap(newtag)

print (soup)

Output


<p><b>Hello Python</b></p>

On the other hand, the unwrap() method replaces a tag with whatever”s inside that tag. It”s good for stripping out markup.

Example


from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>Hello <b>Python</b></p>", ''html.parser'')
tag = soup.p
tag.b.unwrap()

print (soup)

Output


<p>Hello Python</p>

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *