3 Examples Of Parsing Html File Inward Coffee Using Jsoup

HTML is meat of web, all the page you lot come across inward meshing are HTML, whether they are dynamically generated past times JavaScript, JSP, PHP, ASP or whatever other spider web technology. Your browser truly parse HTML in addition to homecoming it for you. But what would you lot do,  if you lot request to parse an HTML document in addition to discovery closed to elements,  tags, attributes or banking firm friction match if a detail chemical constituent exists or non from Java program. If you lot get got been inward Java programming for closed to years, I am certain you lot get got done closed to XML parsing run using parsers similar DOM in addition to SAX, but in that place is too proficient gamble that you lot get got non done whatever HTML parsing work. Ironically, in that place are few instances when you lot request to parse HTML document from meat Java application, which doesn't include Servlet in addition to other Java spider web technologies. To brand the thing worse, in that place is no HTTP or HTML library inward meat JDK equally well; or at to the lowest degree I am non aware of that. That's why when it comes to parse a HTML file, many Java programmers had to expect at Google to discovery out how to acquire value of an HTML tag inward Java. When I needed that I was certain that in that place would endure an opened upward source library which volition does it for me, but didn't know that it was equally wonderful in addition to characteristic rich equally JSoup. It non solely provides back upward to read in addition to parse HTML document but too allows you lot to extract whatever chemical constituent shape HTML file, their attribute, their CSS bird inward JQuery style in addition to too allows you lot to modify them. You tin give notice in all probability practise anything alongside HTML document using Jsoup. In this article, nosotros volition parse in addition to HTML file in addition to discovery out value of championship in addition to heading tags. We volition too come across illustration of downloading in addition to parsing HTML from file equally good equally whatever URL or meshing past times parsing Google's dwelling page inward Java.



What is JSoup Library

Jsoup is an opened upward source Java library for working alongside real-world HTML. It provides a real convenient API for extracting in addition to manipulating data, using the best of DOM, CSS, in addition to jquery-like methods. Jsoup implements the WHATWG HTML5 specification, in addition to parses HTML to the same DOM equally modern browsers similar Chrome in addition to Firefox do. Here are closed to of the useful features of jsoup library :
  •     Jsoup tin give notice scrape in addition to parse HTML from a URL, file, or string
  •     Jsoup tin give notice discovery in addition to extract data, using DOM traversal or CSS selectors
  •     Jsoup allows you lot to manipulate the HTML elements, attributes, in addition to text
  •     Jsoup provides build clean user-submitted content against a prophylactic white-list, to forbid XSS attacks
  •     Jsoup too output tidy HTML
Jsoup is designed to bargain alongside dissimilar kinds of HTML flora inward the existent world, which includes proper validated HTML to incomplete non-validate tag collection. One of the meat line of Jsoup is that it's real robust.


HTML Parsing inward Java using JSoup

In this Java HTML parsing tutorial, nosotros volition come across 3 dissimilar illustration of parsing in addition to traversing HTML document inward Java using jsoup. In root example, nosotros volition parse an HTML String which contents all tags inward shape of String literal inward Java. In Second example, nosotros volition download our HTML document from web, in addition to inward tertiary example, nosotros volition charge our ain sample HTML file login.html for parsing. This file is a sample HTML document which contains championship tag in addition to a div in trunk which contains an HTML form. It has input tags to capture username in addition to password in addition to submit in addition to reset push clit for farther action. It's proper HTML which tin give notice endure validated i.e. all tags in addition to attributes are properly closed. Here is how our sample HTML file expect similar :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html>     <head>         <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">         <title>Login Page</title>     </head>     <body>         <div id="login" class="simple" >             <form action="login.do">                 Username : <input id="username" type="text" /><br>                 Password : <input id="password" type="password" /><br>                 <input id="submit" type="submit" />                 <input id="reset" type="reset" />             </form>         </div>     </body> </html>

HTML parsing is real unproblematic alongside Jsoup, all you lot request to telephone phone is static method Jsoup.parse() and overstep your HTML String to it. JSoup provides several overloaded parse() method to read HTML file from String, a File, from a base of operations URI, from an URL, in addition to from an InputStream. You tin give notice too specify grapheme encoding to read HTML files correctly which is non inward "UTF-8" format. Here is consummate listing of HTML parse method from JSoup library. The parse(String html) method parses the input HTML into a novel Document. In Jsoup, Document extends Element which extends Node. Also TextNode extends Node. As long equally you lot overstep inward a non-null string, you're guaranteed to get got a successful, sensible parse, alongside a Document containing (at least) a caput in addition to a trunk element. Once you lot get got a Document, you lot tin give notice acquire the information you lot desire past times calling appropriate methods inward Document in addition to its nurture classes Element in addition to Node.


Java Program to parse HTML Document

 all the page you lot come across inward meshing are HTML 3 Examples of Parsing HTML File inward Java using Jsoup
Here is our consummate Java computer program to parse an HTML String, an HTML file download from meshing in addition to an HTML file from local file system. In social club to run this program, you lot tin give notice either purpose Eclipse IDE or you lot tin give notice simply use whatever IDE or ascendence prompt. In Eclipse, it's real easy, simply re-create this code, create a novel Java project, right click on src bundle in addition to glue it. Eclipse volition get got assist of creating proper bundle in addition to Java source file alongside same name, hence absolutely less work. If you lot already get got a Sample Java project, in addition to hence it's simply i step. Following Java computer program shows 3 examples of parsing in addition to traversing HTML file. In root example, nosotros direct parse an String alongside html content, inward minute illustration nosotros parse an HTML file downloaded from an URL, inward tertiary illustration nosotros charge in addition to parse an HTML document from local file system. In root in addition to tertiary illustration nosotros purpose parse method to acquire a Document object which tin give notice endure queried to extract whatever tag value or attribute value. In minute example, nosotros purpose Jsoup.connect() with, which takes assist of making connector to URL, downloading HTML in addition to parsing it. This method too returns Document object which tin give notice endure used for farther querying in addition to getting value of whatever tag or attribute.

import java.io.File; import java.io.IOException;   import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element;   /** * Java Program to parse/read HTML documents from File using Jsoup library. * Jsoup is an opened upward source library which allows Java developer to parse HTML * files in addition to extract elements, manipulate data, alter trend using DOM, CSS in addition to * JQuery similar method. * * @author Javin Paul */ public class HTMLParser{       public static void main(String args[]) {           // Parse HTML String using JSoup library         String HTMLSTring = "<!DOCTYPE html>"                 + "<html>"                 + "<head>"                 + "<title>JSoup Example</title>"                 + "</head>"                 + "<body>"                 + "<table><tr><td><h1>HelloWorld</h1></tr>"                 + "</table>"                 + "</body>"                 + "</html>";           Document html = Jsoup.parse(HTMLSTring);         String championship = html.title();         String h1 = html.body().getElementsByTag("h1").text();           System.out.println("Input HTML String to JSoup :" + HTMLSTring);         System.out.println("After parsing, Title : " + title);         System.out.println("Afte parsing, Heading : " + h1);           // JSoup Example two - Reading HTML page from URL         Document doc;         try {             doctor = Jsoup.connect("http://google.com/").get();             championship = doc.title();         } catch (IOException e) {             e.printStackTrace();         }           System.out.println("Jsoup Can read HTML page from URL, championship : " + title);           // JSoup Example 3 - Parsing an HTML file inward Java         //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong         Document htmlFile = null;         try {             htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");         } catch (IOException e) {             // TODO Auto-generated grab block             e.printStackTrace();         } // right         championship = htmlFile.title();         Element div = htmlFile.getElementById("login");         String cssClass = div.className(); // getting bird shape HTML element           System.out.println("Jsoup tin give notice too parse HTML file directly");         System.out.println("title : " + title);         System.out.println("class of div tag : " + cssClass);     }   }

Output: Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html> After parsing, Title : JSoup Example Afte parsing, Heading : HelloWorld Jsoup Can read HTML page from URL, championship : Google Jsoup tin give notice too parse HTML file direct championship : Login Page bird of div tag : simple

Good thing virtually JSoup is that it is real robust. Jsoup HTML parser volition brand every endeavor to create a build clean parse from the HTML you lot provide, regardless of whether the HTML is well-formed or not. It tin give notice guide maintain next mistakes :
unclosed tags (e.g. <p>Java <p>Scala to <p>Java</p> <p>Scala</p>)
implicit tags (e.g. a naked <td>Java is Great</td> is wrapped into a <table><tr><td>)
reliably creating the document construction (html containing a caput in addition to body, in addition to solely appropriate elements inside the head)

That's all virtually how to parse an HTML document inward Java. Jsoup is an splendid in addition to robust opened upward source library which makes reading html document, trunk fragment, html string in addition to direct parsing html content from spider web extremely easy. In this article, nosotros learned hot to acquire value of a detail html tag inward Java, equally inward fist illustration nosotros extracted championship in addition to value of H1 tag equally text, in addition to inward tertiary illustration nosotros learned how to acquire value of an attribute from html tag past times extracting CSS class. Apart from powerful jQuery style html.body().getElementsByTag("h1").text() method, which you lot tin give notice purpose to extract whatever HTML tag, it too provides convenience methods similar Document.title() in addition to Element.className() method to speedily acquire championship in addition to CSS class. Have fun alongside Jsoup in addition to nosotros volition come across duad of to a greater extent than examples of this API soon.

Further Reading
Introduction to Spring MVC 4
RESTFul Services inward Java using Bailiwick of Jersey
Java Web Fundamentals

Belum ada Komentar untuk "3 Examples Of Parsing Html File Inward Coffee Using Jsoup"

Posting Komentar

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel