HTML Parsing using the Firefox DLLs
One of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared earlier were able to handle the malformed HTML in this table where the td elements were prematurely ended. The behavior of Neko and HtmlCleaner made the most sense (while still failing to clean the document) while the output from TagSoup and jTidy was a bit more strange.
However, I noticed that FireBug parsed the document correctly. So I did a bit of research into how I’d be able to use Firefox’s HTML parsing and found a project called Mozilla Parser that had been put together to do just that. Its setup is not quite as nice as the others, but is well documented. Follow the quick start to begin with. Then when you get to the portion where you write actual Java code you may want to follow the example below as it appears the API has been updated since the documentation was posted.
final String BASE_PATH = "C:\\Documents and Settings\\bjm733\\My Documents\\workspace\\MozillaHtmlParser\\";
try {
File parserLibraryFile = new File(BASE_PATH + "native" + File.separator + "bin" + File.separator + "MozillaParser" + EnviromentController.getSharedLibraryExtension());
String parseLibrary = parserLibraryFile.getAbsolutePath();
MozillaParser.init(parseLibrary, BASE_PATH + "mozilla.dist.bin."+EnviromentController.getOperatingSystemName());
MozillaParser parser = new MozillaParser();
document = parser.parse("<html><body>hello world</body></html>");
} catch(Exception e) {
e.printStackTrace();
}
The most unfortunate thing about this approach is that it is not pure Java, which can be a deal breaker in many situations. Also it’s not well maintained with responsive developers.
I can’t get paste the following exception. I have checked and double checked my path many times. Any ideas?
com.dappit.Dapper.parser.ParserInitializationException
at com.dappit.Dapper.parser.MozillaParser.init(Unknown Source)
at com.fantasytruth.accuracy.ParserTest.testParser(ParserTest.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Caused by: java.lang.UnsatisfiedLinkError: C:\dev-tools\MozillaHtmlParser\native\bin\MozillaParser.dll: Can’t find dependent libraries
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(Unknown Source)
at java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.lang.Runtime.load0(Unknown Source)
at java.lang.System.load(Unknown Source)
… 18 more
Thanks for the comparison report. Looks like I am working on something similar a year later. I am thinking of trying Cobra HTML Parser http://lobobrowser.org/cobra.jsp because it is pure java and is CSS and javascript aware and looks like it is more actively maintained than HTMLParser.
Trying this out and get the same error as the others (dependencies). I tried putting the following directories in path :
C:\Documents and Settings\Shiraz>path
PATH=C:\Temp\set\MozillaParser-v-0-3-0\MozillaParser-v-0-3-0\dist\windows;C:\
Temp\set\MozillaParser-v-0-3-0\MozillaParser-v-0-3-0\dist\windows\components
I then tried the following
File parserLibraryFile = new File(“C:/SET/lib/mparser/MozillaParser-v-0-3-0/dist/windows/MozillaParser”
+ EnviromentController.getSharedLibraryExtension());
String parserLibrary = parserLibraryFile.getAbsolutePath();
System.out.println(“Loading Parser Library ” + parserLibrary);
// mozilla.dist.bin directory
final File mozillaDistBinDirectory = new File(
“C:/SET/lib/mparser/MozillaParser-v-0-3-0/dist/”
+ “windows”);
String absPath =mozillaDistBinDirectory.getAbsolutePath();
MozillaParser.init(parserLibrary, absPath);
I still get the following error :
Operating system : Windows XP
Loading Parser Library C:\SET\lib\mparser\MozillaParser-v-0-3-0\dist\windows\MozillaParser.dll
com.dappit.Dapper.parser.ParserInitializationException
at com.dappit.Dapper.parser.MozillaParser.init(Unknown Source)
at first.ParserExample.main(ParserExample.java:30)
Caused by: java.lang.UnsatisfiedLinkError: C:\SET\lib\mparser\MozillaParser-v-0-3-0\dist\windows\MozillaParser.dll: Can’t find dependent libraries
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(Unknown Source)
at java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.lang.Runtime.load0(Unknown Source)
at java.lang.System.load(Unknown Source)
… 2 more
I’ve had it working with the following setup (Windows only)
Append the following two directories to the PATH variable (properly prefixed, e.g., C:/)
MozillaParser-v-0-3-0\dist\windows\mozilla\components
MozillaParser-v-0-3-0\dist\windows\mozilla
Set the following two variables:
// From archive: http://sourceforge.net/projects/mozillaparser/files/mozillaparser/MozillaParser-v-0-3-0/MozillaParser-v-0-3-0.zip/download
String parserLibrary = “C:\\MozillaParser-v-0-3-0\\dist\\windows\\MozillaParser.dll”;
// From archive: http://sourceforge.net/projects/mozillaparser/files/mozillaparser/Mozilla%20Components%20base%20v.0.1/mozilla-dist-bin-windows.zip/download
String mozillaBin = “C:\\bin”
Finally:
MozillaParser.init(parseLib, mozillaBin);
As a side node, the author enters the above string objects into a file object, and then returns the path from the file object. This is probably a better approach, but not necessary to get it all working.
For the record, cobra (http://lobobrowser.org/cobra/java-html-parser.jsp) seems very promising. It offers a very helpful feature to extract all links from a page. Hence, given a html page, cobra downloads includes, stylesheets and external javascript automagically. After the parsing is done, a simple routine returns all links that were found. Unfortunately it had a serious flaw, it could not parse http://www.google.com. Somehow, when parsing javascript it fell into an eternal loop. This simple fact severely reduced the attractiveness of the parser.
Also, i tried parsing your test document with org.htmlparser and it seems to have parsed it okay, even with the weird tags.
yep antony i had a chat with cnt reg the parsing.. now i am able to parse the malformed html tags without any probs.. thanks…