While I may be a little late to extol all of the virtues of Groovy as so many before me have already done with great eloquence…I would like to quickly point out that Groovy GPath rocks!

Take, for instance, a requirement to screen scrap HTML pages. A couple of the ways to approach this task is to use:

and my favorite

GPath is

a path expression language integrated into Groovy which allows parts of nested structured data to be identified

This applies to nested POJOs as well as XML and to boot TagSoup ‘d HTML as demonstrated a few years ago here.

So as a quick example of how easy Groovy makes scraping, let’s scrape this site for the text of the title element using a traditional Java way (keep in mind there are numerous ways to do XPath in pure Java) and then the Groovy ‘er way.

A Traditional Java XPath Approach:


import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.XPathContext;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

public class Main {
    public static void main(String[] args) {
        try {

            XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");

            Builder builder = new Builder(tagsoup);
            Document doc = builder.build(new URL("http://www.ericonjava.com").openStream());

            XPathContext context = new XPathContext("h", "http://www.w3.org/1999/xhtml");
            Nodes table = doc.query("/h:html/h:body/h:div/h:div/h:div/h:div/h:h1/h:a", context);

            System.out.println("TITLE = " + table.get(0).getValue());

        } catch (Exception ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
}

The Groovy GPath Approach:

def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());

def seedURL = new URL("http://www.ericonjava.com")
seedURL.withReader { seedReader ->

    def seedHTML = slurper.parse(seedReader)

        Title= seedHTML.body.div.div.div.div.h1.a
        println "Title =  ${Title}";
    }

Clearly….Groovy GPath FTW.