”;
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
BeautifulSoup package is not a part of Python”s standard library, hence it must be installed. Before installing the latest version, let us create a virtual environment, as per Python”s recommended method.
A virtual environment allows us to create an isolated working copy of python for a specific project without affecting the outside setup.
We shall use venv module in Python”s standard library to create virtual environment. PIP is included by default in Python version 3.4 or later.
Use the following command to create virtual environment in Windows
C:usesuser>python -m venv myenv
On Ubuntu Linux, update the APT repo and install venv if required before creating virtual environment
mvl@GNVBGL3:~ $ sudo apt update && sudo apt upgrade -y mvl@GNVBGL3:~ $ sudo apt install python3-venv
Then use the following command to create a virtual environment
mvl@GNVBGL3:~ $ sudo python3 -m venv myenv
You need to activate the virtual environment. On Windows use the command
C:usesuser>cd myenv C:usesusermyenv>scriptsactivate (myenv) C:Usersusersusermyenv>
On Ubuntu Linux, use following command to activate the virtual environment
mvl@GNVBGL3:~$ cd myenv mvl@GNVBGL3:~/myenv$ source bin/activate (myenv) mvl@GNVBGL3:~/myenv$
Name of the virtual environment appears in the parenthesis. Now that it is activated, we can now install BeautifulSoup in it.
(myenv) mvl@GNVBGL3:~/myenv$ pip3 install beautifulsoup4 Collecting beautifulsoup4 Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.0/143.0 KB 325.2 kB/s eta 0:00:00 Collecting soupsieve>1.2 Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB) Installing collected packages: soupsieve, beautifulsoup4 Successfully installed beautifulsoup4-4.12.2 soupsieve-2.4.1
Note that the latest version of Beautifulsoup4 is 4.12.2 and requires Python 3.8 or later.
If you don”t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.
(myenv) mvl@GNVBGL3:~/myenv$ python setup.py install
To check if Beautifulsoup is properly install, enter following commands in Python terminal −
>>> import bs4 >>> bs4.__version__ ''4.12.2''
If the installation hasn”t been successful, you will get ModuleNotFoundError.
You will also need to install requests library. It is a HTTP library for Python.
pip3 install requests
Installing a Parser
By default, Beautiful Soup supports the HTML parser included in Python”s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.
To install lxml or html5lib parser, use the command:
pip3 install lxml pip3 install html5lib
These parsers have their advantages and disadvantages as shown below −
Parser: Python”s html.parser
Usage − BeautifulSoup(markup, “html.parser”)
Advantages
- Batteries included
- Decent speed
- Lenient (As of Python 3.2)
Disadvantages
- Not as fast as lxml, less lenient than html5lib.
Parser: lxml”s HTML parser
Usage − BeautifulSoup(markup, “lxml”)
Advantages
- Very fast
- Lenient
Disadvantages
-
External C dependency
Parser: lxml”s XML parser
Usage − BeautifulSoup(markup, “lxml-xml”)
Or BeautifulSoup(markup, “xml”)
Advantages
- Very fast
- The only currently supported XML parser
Disadvantages
- External C dependency
Parser: html5lib
Usage − BeautifulSoup(markup, “html5lib”)
Advantages
- Extremely lenient
- Parses pages the same way a web browser does
- Creates valid HTML5
Disadvantages
- Very slow
- External Python dependency
”;