Showdown – Java HTML Parsing Comparison
I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like any markup I’d create. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.
Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:
http://www.theqarena.com
http://cleveland.indians.mlb.com
http://www.clevelandbrowns.com
http://www.cbgarden.org
http://www.clemetzoo.com
http://www.cmnh.org
http://www.clevelandart.org
http://www.mocacleveland.org
http://www.glsc.org
http://www.rockhall.com
I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:
NekoHTML:
final DOMParser parser = new DOMParser();
try {
parser.parse(new InputSource(urlIS));
document = parser.getDocument();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TagSoup:
final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
sax2dom = new SAX2DOM();
parser.setContentHandler(sax2dom);
parser.setFeature(Parser.namespacesFeature, false);
parser.parse(new InputSource(urlIS));
} catch (Exception e) {
e.printStackTrace();
}
document = sax2dom.getDOM();
jTidy:
final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);
HtmlCleaner:
final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
cleaner.clean();
document = cleaner.createDOM();
} catch (Exception e) {
e.printStackTrace();
}
Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:
Found 87 links at http://www.theqarena.com/
Found 156 links at http://cleveland.indians.mlb.com/
Found 96 links at http://www.clevelandbrowns.com/
Found 106 links at http://www.cbgarden.org/
Found 70 links at http://www.clemetzoo.com/
Found 23 links at http://www.cmnh.org/site/
Found 27 links at http://www.clevelandart.org/
Found 51 links at http://www.mocacleveland.org/
Found 27 links at http://www.glsc.org/
Found 90 links at http://www.rockhall.com/
One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement. One drawback to HtmlCleaner is that it’s not available in a Maven repository. Sometimes NekoHTML may be easier to use for this reason. Note also that at the time of writing the last released version of jTidy was from 2000. A newer .jar has since been made available that will likely perform better.
hi there,
thanks for the post. I frequently need an ability to get hold of links from documents and download them automatically. I almost always do it manually, especially if I am behind a corporate firewall. But with HtmlCleaner and XQuery, I should be able to automate most of it NEXT time.
which xquery tool do you use ?
BR,
~A
Hi Anjan,
This method will easily beat the manual one. I use Saxon to run the actual XQuery on the DOM once I’ve received it back from the parse operation. Perhaps if there is interest I can write a post on how to use Saxon.
-Ben
I’m curious about your TagSoup test, as I’ve had good results with it. One thing I found was that it actually created a namespace for the nodes, so I had to preface my xquery with “h:”. Following is from some code I used to print out the movies in my Blockbuster Queue (of course I wrote helper code – not included):
Dom4jXPath xpath = new Dom4jXPath(“//h:div[@class=’title’]”);
xpath.addNamespace(“h”, “http://www.w3.org/1999/xhtml”);
java.util.List divs =xpath.selectNodes(“//h:div[@class=’title’]”, doc);
…
later
…
Dom4jXPath xpath = new Dom4jXPath(“h:a”);
xpath.addNamespace(“h”, “http://www.w3.org/1999/xhtml”);
xpath.stringValue(element);
I do like how HtmlCleaner just required a constructor taking a URL, rather than all the extra code the others require. I will look into it.
Very nice article Ben. I was looking for such a helpful post.
Hi Lance,
You are correct about the TagSoup namespace issue and I’ve updated the code above to reflect an easier fix.
Since I didn’t post the actual TagSoup results I was going to rerun the test and share the results with you. Unfortunately, I’m afraid I’ve changed the code in some manner since I’ve written this post. I’m not able to get very far since I’m now being presented with an exception: “org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.” I’m not going to spend any time debugging this since HTML Cleaner seems to work well enough for me, but if you’ve run into the same problem and know what I’m doing wrong I’d be happy to take another look.
-Ben
Nice! We rather appreciated the website
[…] of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared […]
“Only URL to nice internet shop (for beauties!) shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug… so that all parsers behave the same, correctly parsing ugliest HTML.”
😉
Bambaria,
I don’t see your results anywhere. Did you try any of the links I published? I found that most of the parsers did not handle these URLs well. The Mozilla Parser is the best I’ve come across for malformed xml.
-Ben
Hi,
thanx for the article. Unfortunately, I am not able to use NekoHTML with Saxon. Saxon always crushes down with an error that the document ins’t valid. Do you think you could leave there an example of your code?
Thank you
Tom
You didn’t evaluate Cobra:
http://lobobrowser.org/cobra/java-html-parser.jsp
I am attempting to just download a page and be able to obtain the images and text in it. What is the best (well documented easily usable crawler/scraper/ parser) for such a purpose?
I would like screen scrape an html page and then simply view / display its text and images. Or be able to get the raw data of the image as a handle (to store in db or display)?
I tried using mozilla parser but it seems too complicated to use. I just want the page text content and the images. I tried htmlparser but that does seem to be working properly : http://htmlparser.sourceforge.net/javadoc/org/htmlparser/parserapplications/SiteCapturer.html It does not load images or stores them locally. I would just like to get reproduce the html and get a handle on the images please.
Hi I Support HtmlCleaner the syntax has changed though as of now. But anyways HTML Cleaner did it for me hte following learnings happened for me while i was creating HTML scrapping program given an xpath I was to extract the innerHTML. I got it working with Javascript in 2 days.But, the rest follows…
HTML Parser i just couldnt get the job done as the APIs are so different and some useful things are missing. Not even remotely similar to using DOM parser.It does have Filters but too difficult to manage.I needed to test others so quickly
TagSoup it did most of the job well but it required cleaned HTML.
something Some other thingis needed to be cleaned up
So had to put in jTidy to clean up. But it required to save the intermediate file and then parse again from it. Alas my earlier xpath is no longer valid as now the DOM structure of temp html is different. Doesnt suit me.
Finally Tried Mozilla parser well in my office i dont hv rights to path variable so gave up. As other options to start was too time consuming to discover.
HTML Cleaner cleaner did the job for me no intermediate file saving. No creating xpath from intermediate files. It did parse malformed HTML as its parsed by IE or Mozilla(I know this because my JavaScript works for Mozilla as well as IE without any changes)
Great blog, found here all that was looking for.
So nice site. I will visit it more often and read comments. Thx u a lot
Nice one, I’ve been looking for this in days already ^_^
Just a question, how about if I have the content that I want to parse? which among them are able to parse a string content directly not from a URL?
I am using this code..
DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
doc = docBuilder.parse( new java.io.StringBufferInputStream(content) );
but I have no luck whenever the content is a malformed one. I need to convert this content into a Document or something in a form of nodes.
QUESTION:
————
Which among the tools u mentioned is capable of parsing the content directly and do some corrections if needed?
It would really be a great help.
Thanks and Godbless ^_^
————
Borgy,
If you need to turn a String into an InputStream you can do the following:
new ByteArrayInputStream(string.getBytes())
tnx ben ^_^
I have a noob question regarding HTMLCleaner this time.
How can I transform from lower-case into upper-case? I want all the tags to be in Upper-case since the
xsl file that will use it is expecting all tags in upper-case form…. and I dont have idea about XSLT ^_^
Thank you and Godbless ^_^
I have tried adding this code… hoping that it will convert into but I have no luck..
CleanerTransformations transformations = new CleanerTransformations();
TagTransformation tagTransformation = new TagTransformation(“table”, “TABLE”, true);
transformations.addTransformation(tagTransformation);
cleaner.setTransformations(transformations);
this code does not transform “table” into “TABLE” but…. if I try to transform it from “table” into “TABLET” or any value other than “table” it will transform…
is there a way where in I can transform all tags into Uppercase retaining all the tag attributes… in an instant? ^_^
Hi Ben and readers,
Self plug: I’ve just released a new open source HTML parser called jsoup @ http://jsoup.org/ . Its goal is to deal with all real-world HTML. The interface is designed a bit differently than the above parsers; it combines the best of DOM, CSS, and jquery-like accessor methods.
I ran the URLs from the showdown through it:
List urls = Arrays.asList(“http://www.theqarena.com”, “http://cleveland.indians.mlb.com”,
“http://www.clevelandbrowns.com”, “http://www.cbgarden.org”, “http://www.clemetzoo.com”,
“http://www.cmnh.org/site/”, “http://www.clevelandart.org”, “http://www.mocacleveland.org”,
“http://www.glsc.org”, “http://www.rockhall.com”);
for (String urlString : urls) {
URL url = new URL(urlString);
Document doc = Jsoup.parse(url, 3*1000);
Elements links = doc.select(“a”);
System.out.println(String.format(“Found %d links at %s”, links.size(), urlString));
}
The results: (all parse successfully)
Found 71 links at http://www.theqarena.com
Found 330 links at http://cleveland.indians.mlb.com
Found 105 links at http://www.clevelandbrowns.com
Found 135 links at http://www.cbgarden.org
Found 86 links at http://www.clemetzoo.com
Found 320 links at http://www.cmnh.org/site/
Found 103 links at http://www.clevelandart.org
Found 63 links at http://www.mocacleveland.org
Found 34 links at http://www.glsc.org
Found 90 links at http://www.rockhall.com
Probably all of these sites have updated since the original post so it’s not totally comparable, but I think it’s a pretty good showing, and the interface is (imho) a lot easier to use.
Nice article and good covering. I have found the LOBO to be somewhat useful.
Sorry, I am taking this off topic.
Jonathon, Your implementation looks interesting and useful for our startup that is scrapping stock prices from all over internet. However we are planning to use third party XQuery and I need org.w3c.dom.Document. Your implementation has a custom Document. How to convert your document model to org.w3c.Document tree? Any pointers much appreciated.
Hi, Very Nice information. I will try this logic in my future project & get back U soon.
Hi,
Any chance you could put up a small working project with all of these, I’m havign a lot of trouble getting tagsoup going in android.
Cheers,
S++
very nice resource for web developers. Thanks dude for valuable information.
Hi,
Can someone please help me with the following problem that I’m facing ?
I am parsing certain html pages. For a best effort , I’m using Neko, and if Neko fails, my code will switch to JTidy. After parsing, I use xpath to extract some information in the page. The problem is that Neko prefixes “xhtml” before every element in the parsed DOM. So I have to specify this prefix in the Xpath also (Eg, //xhtml:a/@href ) Because of this problem, I’m not able to use a common xpath for extraction, while not worrying about which parser created the DOM ( Neko or JTidy), Please help.
Thanks,
Ankur.
Thanks to all for your posts and your time !!!
I really appreciate it !
Thanks for the post, great comparison!!!
What is urlIS here ? which class’s object is it.
actually i am trying to parse an html page stored on disk .
how to supply the FileReader object to HTMLCleaner Constructor
Hi Utkarsh,
urlIs is a java.io.InputStream. One way of getting an InputStream is to call URL.openStream(). If you’re reading from files on disk you’ll probably want to use a java.io.FileInputStream.
thanks Ben
Hi Ben in your code for Html Cleaner you are using
document=cleaner.createDOM();
However this function is not present in HTMLCleaner class
So, please help me.
Thanks
utkarsh, you can use the following:
CleanerProperties props = new CleanerProperties();
//not sure if you will need it, but i needed it..
props.setNamespacesAware(false);
DomSerializer dom = new DomSerializer(props);
Document doc = dom.createDOM(clean);
Thanks for this excellent article. After 3 years of the initial comparison, the results are still useful to choose the appropriate html parser for our needs. I will use HTMLCleaner by the way…
I’m particularly pleased to find this site. I want to give you thanks for ones time put in writing this superb post. I most certainly appreciated each part of it and i also have you book-marked to check out new articles or blog posts on your site.
This is a very useful post! Thank you very much.
I like HtmlCleaner. It saves me a lot of time while parsing html files.