jsoup – Using Selector Syntax

jsoup – Using Selector Syntax ”; Previous Next Following example will showcase use of selector methods after parsing an HTML String into a Document object. jsoup supports selectors similar to CSS Selectors. Syntax Document document = Jsoup.parse(html); Element sampleDiv = document.getElementById(“sampleDiv”); Elements links = sampleDiv.getElementsByTag(“a”); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. sampleDiv − Element object represent the html node element identified by id “sampleDiv”. links − Elements object represents the multiple node elements identified by tag “a”. Description The document.select(expression) method parses the given CSS selector expression to select a html dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<p>Sample Content</p>” + “<div id=”sampleDiv”><a href=”www.google.com”>Google</a>” + “<h3><a>Sample</a><h3>” +”</div>” + “<div id=”imageDiv” class=”header”><img name=”google” src=”google.png” />” + “<img name=”yahoo” src=”yahoo.jpg” />” +”</div>” +”</body></html>”; Document document = Jsoup.parse(html); //a with href Elements links = document.select(“a[href]”); for (Element link : links) { System.out.println(“Href: ” + link.attr(“href”)); System.out.println(“Text: ” + link.text()); } // img with src ending .png Elements pngs = document.select(“img[src$=.png]”); for (Element png : pngs) { System.out.println(“Name: ” + png.attr(“name”)); } // div with class=header Element headerDiv = document.select(“div.header”).first(); System.out.println(“Id: ” + headerDiv.id()); // direct a after h3 Elements sampleLinks = document.select(“h3 > a”); for (Element link : sampleLinks) { System.out.println(“Text: ” + link.text()); } } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Href: www.google.com Text: Google Name: google Id: imageDiv Text: Sample Print Page Previous Next Advertisements ”;

jsoup – Extract HTML

jsoup – Extract HTML ”; Previous Next Following example will showcase use of methods to get inner html and outer html after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element link = document.select(“a”).first(); System.out.println(“Outer HTML: ” + link.outerHtml()); System.out.println(“Inner HTML: ” + link.html()); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. link − Element object represent the html node element representing anchor tag. link.outerHtml() − outerHtml() method retrives the element complete html. link.html() − html() method retrives the element inner html. Description Element object represent a dom elment and provides various method to get the html of a dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<p>Sample Content</p>” + “<div id=”sampleDiv”><a href=”www.google.com”>Google</a>” + “<h3><a>Sample</a><h3>” +”</div>” +”</body></html>”; Document document = Jsoup.parse(html); //a with href Element link = document.select(“a”).first(); System.out.println(“Outer HTML: ” + link.outerHtml()); System.out.println(“Inner HTML: ” + link.html()); } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Outer HTML: <a href=”www.google.com”>Google</a> Inner HTML: Google Print Page Previous Next Advertisements ”;

jsoup – Loading URL

jsoup – Loading from URL ”; Previous Next Following example will showcase fetching an HTML from the web using a url and then find its data. Syntax String url = “http://www.google.com”; Document document = Jsoup.connect(url).get(); Where document − document object represents the HTML DOM. Jsoup − main class to connect the url and get the HTML String. url − url of the html page to load. Description The connect(url) method makes a connection to the url and get() method return the html of the requested url. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupTester { public static void main(String[] args) throws IOException { String url = “http://www.google.com”; Document document = Jsoup.connect(url).get(); System.out.println(document.title()); } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Google Print Page Previous Next Advertisements ”;

jsoup – Set Attributes

jsoup – Set Attributes ”; Previous Next Following example will showcase use of method to set attributes of a dom element, bulk updates and add/remove class methods after parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Element link = document.select(“a”).first(); link.attr(“href”,”www.yahoo.com”); link.addClass(“header”); link.removeClass(“header”); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. link − Element object represent the html node element representing anchor tag. link.attr() − attr(attribute,value) method set the element attribute the corresponding value. link.addClass() − addClass(class) method add the class under class attribute. link.removeClass() − removeClass(class) method remove the class under class attribute. Description Element object represent a dom elment and provides various method to get the attribute of a dom element. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body>” + “<p>Sample Content</p>” + “<div id=”sampleDiv”><a id=”googleA” href=”www.google.com”>Google</a></div>” + “<div class=”comments”><a href=”www.sample1.com”>Sample1</a>” + “<a href=”www.sample2.com”>Sample2</a>” + “<a href=”www.sample3.com”>Sample3</a><div>” +”</div>” + “<div id=”imageDiv” class=”header”><img name=”google” src=”google.png” />” + “<img name=”yahoo” src=”yahoo.jpg” />” +”</div>” +”</body></html>”; Document document = Jsoup.parse(html); //Example: set attribute Element link = document.getElementById(“googleA”); System.out.println(“Outer HTML Before Modification :” + link.outerHtml()); link.attr(“href”,”www.yahoo.com”); System.out.println(“Outer HTML After Modification :” + link.outerHtml()); System.out.println(“—“); //Example: add class Element div = document.getElementById(“sampleDiv”); System.out.println(“Outer HTML Before Modification :” + div.outerHtml()); link.addClass(“header”); System.out.println(“Outer HTML After Modification :” + div.outerHtml()); System.out.println(“—“); //Example: remove class Element div1 = document.getElementById(“imageDiv”); System.out.println(“Outer HTML Before Modification :” + div1.outerHtml()); div1.removeClass(“header”); System.out.println(“Outer HTML After Modification :” + div1.outerHtml()); System.out.println(“—“); //Example: bulk update Elements links = document.select(“div.comments a”); System.out.println(“Outer HTML Before Modification :” + links.outerHtml()); links.attr(“rel”, “nofollow”); System.out.println(“Outer HTML Before Modification :” + links.outerHtml()); } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Outer HTML Before Modification :<a id=”googleA” href=”www.google.com”>Google</a> Outer HTML After Modification :<a id=”googleA” href=”www.yahoo.com”>Google</a> — Outer HTML Before Modification :<div id=”sampleDiv”> <a id=”googleA” href=”www.yahoo.com”>Google</a> </div> Outer HTML After Modification :<div id=”sampleDiv”> <a id=”googleA” href=”www.yahoo.com” class=”header”>Google</a> </div> — Outer HTML Before Modification :<div id=”imageDiv” class=”header”> <img name=”google” src=”google.png”> <img name=”yahoo” src=”yahoo.jpg”> </div> Outer HTML After Modification :<div id=”imageDiv” class=””> <img name=”google” src=”google.png”> <img name=”yahoo” src=”yahoo.jpg”> </div> — Outer HTML Before Modification :<a href=”www.sample1.com”>Sample1</a> <a href=”www.sample2.com”>Sample2</a> <a href=”www.sample3.com”>Sample3</a> Outer HTML Before Modification :<a href=”www.sample1.com” rel=”nofollow”>Sample1</a> <a href=”www.sample2.com” rel=”nofollow”>Sample2</a> <a href=”www.sample3.com” rel=”nofollow”>Sample3</a> Print Page Previous Next Advertisements ”;

jsoup – Home

jsoup Tutorial Quick Guide Resources Job Search Discussion jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. This reference will take you through simple and practical methods available in jsoup library. Audience This reference has been prepared for the beginners to help them understand the basic functionality related to functionality available in jsoup library. Prerequisites Before you start doing practice with various types of examples given in this reference, I”m making an assumption that you are already aware of basic Java Programming. Print Page Previous Next Advertisements ”;

jsoup – Parsing Body

jsoup – Parsing Body ”; Previous Next Following example will showcase parsing an HTML fragement String into a Element object as html body. Syntax Document document = Jsoup.parseBodyFragment(html); Element body = document.body(); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML fragment String. body − represents element children of the document”s body element and is equivalent to document.getElementsByTag(“body”). Description The parseBodyFragment(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html body fragment. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = “<div><p>Sample Content</p>”; Document document = Jsoup.parseBodyFragment(html); Element body = document.body(); Elements paragraphs = body.getElementsByTag(“p”); for (Element paragraph : paragraphs) { System.out.println(paragraph.text()); } } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Sample Content Print Page Previous Next Advertisements ”;

jsoup – Parsing String

jsoup – Parsing String ”; Previous Next Following example will showcase parsing an HTML String into a Document object. Syntax Document document = Jsoup.parse(html); Where document − document object represents the HTML DOM. Jsoup − main class to parse the given HTML String. html − HTML String. Description The parse(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom. Example Create the following java program using any editor of your choice in say C:/> jsoup. JsoupTester.java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = “<html><head><title>Sample Title</title></head>” + “<body><p>Sample Content</p></body></html>”; Document document = Jsoup.parse(html); System.out.println(document.title()); Elements paragraphs = document.getElementsByTag(“p”); for (Element paragraph : paragraphs) { System.out.println(paragraph.text()); } } } Verify the result Compile the class using javac compiler as follows: C:jsoup>javac JsoupTester.java Now run the JsoupTester to see the result. C:jsoup>java JsoupTester See the result. Sample Title Sample Content Print Page Previous Next Advertisements ”;