jsoup – Loading File

jsoup – Loading from File ”; Previous Next Following example will showcase fetching an HTML from the disk using a file and then find its data. Syntax String url = “http://www.google.com”; Document document = Jsoup.connect(url).get(); Where document − document object represents the HTML DOM. Jsoup − main class to connect the url and get the HTML String. url − url of the html page to load. Description The connect(url) method makes a connection to the url and get() method return the html of the requested url. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import java.io.File; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupTester { public static void main(String[] args) throws IOException, URISyntaxException { URL path = ClassLoader.getSystemResource(“test.htm”); File input = new File(path.toURI()); Document document = Jsoup.parse(input, “UTF-8″); System.out.println(document.title()); } } test.htm Create following test.htm file in C:jsoup folder. <html> <head> <title>Sample Title</title> </head> <body> <p>Sample Content</p> </body> </html> Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Sample Title Print Page Previous Next Advertisements ”;

jsoup – Environment Setup

jsoup – Environment Setup ”; Previous Next Step 1: Verify Java Installation in Your Machine First of all, open the console and execute a java command based on the operating system you are working on. OS Task Command Windows Open Command Console c:> java -version Linux Open Command Terminal $ java -version Mac Open Terminal machine:< joseph&dollar; java -version Let”s verify the output for all the operating systems − OS Output Windows Java 11.0.11 2021-04-20 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.11+9-LTS-194) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.11+9-LTS-194, mixed mode) Linux Java 11.0.11 2021-04-20 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.11+9-LTS-194) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.11+9-LTS-194, mixed mode) Mac Java 11.0.11 2021-04-20 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.11+9-LTS-194) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.11+9-LTS-194, mixed mode) If you do not have Java installed on your system, then download the Java Software Development Kit (SDK) from the following link www.oracle.com/technetwork/java/javase/downloads/index.html. We are assuming Java 11.0.11 as the installed version for this tutorial. Step 2: Set JAVA Environment Set the JAVA_HOME environment variable to point to the base directory location where Java is installed on your machine. For example. OS Output Windows Set the environment variable JAVA_HOME to C:Program FilesJavajdk11.0.11 Linux export JAVA_HOME = /usr/local/java-current Mac export JAVA_HOME = /Library/Java/Home Append Java compiler location to the System Path. OS Output Windows Append the string C:Program FilesJavajdk11.0.11bin at the end of the system variable, Path. Linux export PATH = $PATH:$JAVA_HOME/bin/ Mac not required Verify Java installation using the command java -version as explained above. Step 3: Download jsoup Archive Download the latest version of jsoup jar file from Maven Repository. At the time of writing this tutorial, we have downloaded jsoup-1.14.3.jar and copied it into C:>jsoup folder. OS Archive name Windows jsoup-1.14.3.jar Linux jsoup-1.14.3.jar Mac jsoup-1.14.3.jar Step 4: Set jsoup Environment Set the JSOUP_HOME environment variable to point to the base directory location where jsoup jar is stored on your machine. Let”s assuming we”ve stored jsoup-1.14.3.jar in the JSOUP folder. Sr.No OS & Description 1 Windows Set the environment variable JSOUP_HOME to C:JSOUP 2 Linux export JSOUP_HOME = /usr/local/JSOUP 3 Mac export JSOUP_HOME = /Library/JSOUP Step 5: Set CLASSPATH Variable Set the CLASSPATH environment variable to point to the JSOUP jar location. Sr.No OS & Description 1 Windows Set the environment variable CLASSPATH to %CLASSPATH%;%JSOUP_HOME%jsoup-1.14.3.jar;.; 2 Linux export CLASSPATH = $CLASSPATH:$JSOUP_HOME/jsoup-1.14.3.jar:. 3 Mac export CLASSPATH = $CLASSPATH:$JSOUP_HOME/jsoup-1.14.3.jar:. Print Page Previous Next Advertisements ”;

jsoup – Sanitize HTML

jsoup – Sanitize HTML ”; Previous Next Following example will showcase prevention of XSS attacks or cross-site scripting attack. Syntax String safeHtml = Jsoup.clean(html, Safelist.basic()); Where Jsoup − main class to parse the given HTML String. html − Initial HTML String. safeHtml − Cleaned HTML. Safelist − Object to provide default configurations to safeguard html. clean() − cleans the html using Whitelist. Description Jsoup object sanitizes an html using Whitelist configurations. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.safety.Safelist; public class JsoupTester { public static void main(String[] args) { String html = “<p><a href=”http://example.com/”” +” onclick=”checkData()”>Link</a></p>”; System.out.println(“Initial HTML: ” + html); String safeHtml = Jsoup.clean(html, Safelist.basic()); System.out.println(“Cleaned HTML: ” +safeHtml); } } Verify the result Compile the class using javac compiler as follows − C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Initial HTML: <p><a href=”http://example.com/” onclick=”checkData()”>Link</a></p> Cleaned HTML: <p><a href=”http://example.com/” rel=”nofollow”>Link</a></p> Print Page Previous Next Advertisements ”;

jsoup – Discussion

Discuss jsoup ”; Previous Next jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. This reference will take you through simple and practical methods available in jsoup library. Print Page Previous Next Advertisements ”;

jsoup – Set HTML

jsoup – Set HTML ”; Previous Next Following example will showcase use of method to set, prepend or append html to a dom element after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element div = document.getElementById(“sampleDiv”); div.html(“<p>This is a sample content.</p>”); div.prepend(“<p>Initial Text</p>”); div.append(“<p>End Text</p>”); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. div − Element object represent the html node element representing anchor tag. div.html() − html(content) method replaces the element”s outer html with the corresponding value. div.prepend() − prepend(content) method adds the content before the outer html. div.append() − append(content) method adds the content after the outer html. Description Element object represent a dom elment and provides various method to set, prepend or append html to a dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<div id=”sampleDiv”><a id=”googleA” href=”www.google.com”>Google</a></div>” +”</body></html>”; Document document = Jsoup.parse(html); Element div = document.getElementById(“sampleDiv”); System.out.println(“Outer HTML Before Modification :n” + div.outerHtml()); div.html(“<p>This is a sample content.</p>”); System.out.println(“Outer HTML After Modification :n” + div.outerHtml()); div.prepend(“<p>Initial Text</p>”); System.out.println(“After Prepend :n” + div.outerHtml()); div.append(“<p>End Text</p>”); System.out.println(“After Append :n” + div.outerHtml()); } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Outer HTML Before Modification : <div id=”sampleDiv”> <a id=”googleA” href=”www.google.com”>Google</a> </div> Outer HTML After Modification : <div id=”sampleDiv”> <p>This is a sample content.</p> </div> After Prepend : <div id=”sampleDiv”> <p>Initial Text</p> <p>This is a sample content.</p> </div> After Append : <div id=”sampleDiv”> <p>Initial Text</p> <p>This is a sample content.</p> <p>End Text</p> </div> Outer HTML Before Modification : <span>Sample Content</span> Outer HTML After Modification : <span>Sample Content</span> Print Page Previous Next Advertisements ”;

jsoup – Extract Text

jsoup – Extract Text ”; Previous Next Following example will showcase use of methods to get text after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element link = document.select(“a”).first(); System.out.println(“Text: ” + link.text()); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. link − Element object represent the html node element representing anchor tag. link.text() − text() method retrives the element text. Description Element object represent a dom elment and provides various method to get the text of a dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<p>Sample Content</p>” + “<div id=”sampleDiv”><a href=”www.google.com”>Google</a>” + “<h3><a>Sample</a><h3>” +”</div>” +”</body></html>”; Document document = Jsoup.parse(html); //a with href Element link = document.select(“a”).first(); System.out.println(“Text: ” + link.text()); } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Text: Google Print Page Previous Next Advertisements ”;

jsoup – Set Text Content

jsoup – Set Text Content ”; Previous Next Following example will showcase use of method to set, prepend or append text to a dom element after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element div = document.getElementById(“sampleDiv”); div.text(“This is a sample content.”); div.prepend(“Initial Text.”); div.append(“End Text.”); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. div − Element object represent the html node element representing anchor tag. div.text() − text(content) method replaces the element”s content with the corresponding value. div.prepend() − prepend(content) method adds the content before the outer html. div.append() − append(content) method adds the content after the outer html. Description Element object represent a dom elment and provides various method to set, prepend or append html to a dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<div id=”sampleDiv”><a id=”googleA” href=”www.google.com”>Google</a></div>” +”</body></html>”; Document document = Jsoup.parse(html); Element div = document.getElementById(“sampleDiv”); System.out.println(“Outer HTML Before Modification :n” + div.outerHtml()); div.text(“This is a sample content.”); System.out.println(“Outer HTML After Modification :n” + div.outerHtml()); div.prepend(“Initial Text.”); System.out.println(“After Prepend :n” + div.outerHtml()); div.append(“End Text.”); System.out.println(“After Append :n” + div.outerHtml()); } } Verify the result Compile the class using javac compiler as follows − C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Outer HTML Before Modification : <div id=”sampleDiv”> <a id=”googleA” href=”www.google.com”>Google</a> </div> Outer HTML After Modification : <div id=”sampleDiv”> This is a sample content. </div> After Prepend : <div id=”sampleDiv”> Initial Text.This is a sample content. </div> After Append : <div id=”sampleDiv”> Initial Text.This is a sample content.End Text. </div> Print Page Previous Next Advertisements ”;

jsoup – Using DOM Methods

jsoup – Using DOM Methods ”; Previous Next Following example will showcase use of DOM like methods after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element sampleDiv = document.getElementById(“sampleDiv”); Elements links = sampleDiv.getElementsByTag(“a”); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. sampleDiv − Element object represent the html node element identified by id “sampleDiv”. links − Elements object represents the multiple node elements identified by tag “a”. Description The parse(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<p>Sample Content</p>” + “<div id=”sampleDiv”><a href=”www.google.com”>Google</a></div>” +”</body></html>”; Document document = Jsoup.parse(html); System.out.println(document.title()); Elements paragraphs = document.getElementsByTag(“p”); for (Element paragraph : paragraphs) { System.out.println(paragraph.text()); } Element sampleDiv = document.getElementById(“sampleDiv”); System.out.println(“Data: ” + sampleDiv.text()); Elements links = sampleDiv.getElementsByTag(“a”); for (Element link : links) { System.out.println(“Href: ” + link.attr(“href”)); System.out.println(“Text: ” + link.text()); } } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Sample Title Sample Content Data: Google Href: www.google.com Text: Google Print Page Previous Next Advertisements ”;

jsoup – Overview

jsoup – Overview ”; Previous Next jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup libary implements the WHATWG HTML5 specification, and parses an HTML content to the same DOM as per the modern browsers. jsonp library provides following functionalities. Multiple Read Support − It reads and parses HTML using URL, file, or string. CSS Selectors − It can find and extract data, using DOM traversal or CSS selectors. DOM Manipulation − It can manipulate the HTML elements, attributes, and text. Prevent XSS attacks − It can clean user-submitted content against a given safe white-list, to prevent XSS attacks. Tidy − It outputs tidy HTML. Handles invalid data − jsoup can handle unclosed tags, implicit tags and can reliably create the document structure. Print Page Previous Next Advertisements ”;

jsoup – Extract Attributes

jsoup – Extract Attributes ”; Previous Next Following example will showcase use of method to get attribute of a dom element after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element link = document.select(“a”).first(); System.out.println(“Href: ” + link.attr(“href”)); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. link − Element object represent the html node element representing anchor tag. link.attr() − attr(attribute) method retrives the element attribute. Description Element object represent a dom elment and provides various method to get the attribute of a dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<p>Sample Content</p>” + “<div id=”sampleDiv”><a href=”www.google.com”>Google</a>” + “<h3><a>Sample</a><h3>” +”</div>” +”</body></html>”; Document document = Jsoup.parse(html); //a with href Element link = document.select(“a”).first(); System.out.println(“Href: ” + link.attr(“href”)); } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Href: www.google.com Print Page Previous Next Advertisements ”;