Dependency Management: HtmlUnit 2
If you are planning on building an API, please, please, think about dependency management. Don’t make me know more about your world view than necessary. Consider what happened to me as I explored HtmlUnit…
I’m using HtmlUnit to parse and interpret HTML web pages. I’ve been very impressed with this library so far. And I appreciate the hard work and dedication of people who give their software away for free. So, although this blog is a complaint, it should not be misconstrued into anything more than constructive criticism. Besides, what I am complaining about here is so universal that it really wouldn’t matter whose software I chose to scrutinize. The HtmlUnit authors just got lucky in this case.
What I want to do with HtmlUnit is quite simple. Given a string containing HTML, I’d like to query that HTML for certain tags and attributes. For example, I’d like to do this:
HtmlPage page = HTMLParser.parse(htmlString);
HtmlElement html = page.getDocumentElement();
HtmlElement listForm = html.getHtmlElementById("list_form");
assertEquals("/Library/books/manage.do", listForm.getAttributeValue("Action"));
Sweet, simple, uncomplicated. Just create the DOM from an HTML String, and then query that DOM.
Unfortunately, HtmlUnit does not appear to be that simple. What you have to do instead looks like this:
StringWebResponse stringWebResponse = new StringWebResponse(htmlString);
WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(false);
HtmlPage page = HTMLParser.parse(stringWebResponse, new TopLevelWindow("", webClient));
HtmlElement html = page.getDocumentElement();
HtmlElement listForm = html.getHtmlElementById("list_form");
assertEquals("/Library/books/manage.do", listForm.getAttributeValue("Action"));
The extra stuff in here is apparently due to the fact that the authors wanted to be able to simulate browsers, frames, and javascript. I think their goal was laudable. However, I wish they had done this without forcing those frames, browsers, and script engines down my throat.
Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the javascript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.
The cost, to me, was an hour of rooting around in the documentation, example code, and my own trial-and-error experiments. (The benefit to me was another blog topic ;-) That cost may not seem great; but it must be paid again and again by everyone who wants to use the package in a way that doesn’t quite fit the authors’ world view.
There may, in fact, be a simpler way to do what I want to do with HtmlUnit. If there is, I haven’t been able to find it, and I’d be grateful if anyone out there, including the authors, could guide me in the right direction.
Trackbacks
Use the following link to trackback from your own site:
http://blog.objectmentor.com/articles/trackback/159

HtmlUnit is streamlined for accessing sites (perhaps the String case is not so well handled). Here is the normal thing you would do – coded in Groovy:
import com.gargoylesoftware.htmlunit.WebClient def webClient = new WebClient() def page = webClient.getPage(some_url) def listForm = page.getFormByName('list_form') assert '/Library/books/manage.do' == listForm.getAttributeValue("Action")I can’t thank you enough, you saved me a couple hours of bumbling around with HtmlUnit. I’ve ran into quite an issue involving a Javascript routine that returns a bit of JSON that I can play a bit with to decode into Html. I then wanted to take that Html and create an HtmlPage out if, which I would then in turn parse.
I think I was on the right path. What I believe I was doing wrong was using my existing WebClient object to create the HtmlPage with a StringWebResponse.
I can’t get enough praise to the HtmlUnit library. It truely is a gem and “just works” in most cases.