Java, JavaFX, Groovy, Grails …
Posts tagged groovy xpath gpath
Screen Scraping? Groovy GPath FTW!
Jun 23rd
While I may be a little late to extol all of the virtues of Groovy as so many before me have already done with great eloquence…I would like to quickly point out that Groovy GPath rocks!
Take, for instance, a requirement to screen scrap HTML pages. A couple of the ways to approach this task is to use:
- Regular Expressions parsing (hopefully you’re a regex ninja)
- XPath (W3C recommendation)
and my favorite
- GPath (a Groovier XPath)
GPath is
a path expression language integrated into Groovy which allows parts of nested structured data to be identified
This applies to nested POJOs as well as XML and to boot TagSoup ‘d HTML as demonstrated a few years ago here.
So as a quick example of how easy Groovy makes scraping, let’s scrape this site for the text of the title element using a traditional Java way (keep in mind there are numerous ways to do XPath in pure Java) and then the Groovy ‘er way.
A Traditional Java XPath Approach:
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.XPathContext;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
public class Main {
public static void main(String[] args) {
try {
XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
Builder builder = new Builder(tagsoup);
Document doc = builder.build(new URL("http://www.ericonjava.com").openStream());
XPathContext context = new XPathContext("h", "http://www.w3.org/1999/xhtml");
Nodes table = doc.query("/h:html/h:body/h:div/h:div/h:div/h:div/h:h1/h:a", context);
System.out.println("TITLE = " + table.get(0).getValue());
} catch (Exception ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
The Groovy GPath Approach:
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());
def seedURL = new URL("http://www.ericonjava.com")
seedURL.withReader { seedReader ->
def seedHTML = slurper.parse(seedReader)
Title= seedHTML.body.div.div.div.div.h1.a
println "Title = ${Title}";
}
Clearly….Groovy GPath FTW.