Web Scraping - pford68/groovy-examples GitHub Wiki

Groovy features make screen scraping easy. Url fetching in groovy uses Java classes like java.net.URL, but facilitated by additional Groovy methods such as withReader.

URL Fetching

Example: Reading content from a web page

// Contents of http://www.mrhaki.com/url.html:
// Simple test document
// for testing URL extensions
// in Groovy.
 
// Convert the URL string to a URL.  This apparently does the fetch too.
def url = "http://www.mrhaki.com/url.html".toURL()
 
assert '''\
Simple test document
for testing URL extensions
in Groovy.
''' == url.text
 
def result = []
// Looping through each line of the web page.
url.eachLine {
    if (it =~ /Groovy/) {
        result << it
    }
}
assert ['in Groovy.'] == result

// Reading each line from the web page 
url.withReader { reader ->
    assert 'Simple test document' == reader.readLine()

Another example

import org.ccil.cowan.tagsoup.Parser;
     
String ENCODING = "UTF-8"
 
@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2')       
def PARSER = new XmlSlurper(new Parser() )
 
def url = "http://www.bing.com/search?q=web+scraping"
 
new URL(url).withReader (ENCODING) { reader -> 
 
    def document = PARSER.parse(reader) 
    // Extracting information
}

HTML Parsing

Html parsing can be done with any of the many available html-parsing java tools like tagsoup or cyberneko. In this example we have used tagsoup and we can see how easy we declare our dependency on the library thanks to Grapes.

On top of that groovy’s xmlslurper and gpath allow to access specific parts of the parsed html in a convenient way. For the example of the article we would just need a line of code to extract the titles of the search results.

Below are two different ways to achieve that goal. For both examples we first use groovy’s ‘**’ to search for all document’s children in depth, this way we can find which one has as its id results.

First method

//JQuery selector: $('#results h3 a')
document.'**'.find{ it['@id'] == 'results'}.ul.li.div.div.h3.a.each { println it.text() }

In first example we specify the full element path from the results element to the links that represent the titles. As we can see this is less handy than just saying “i want all h3 descendants” the way it is done with JQuery.

Second method

//JQuery selector: $('#results h3 a')
document.'**'.find{ it['@id'] == 'results'}.'**'.findAll{ it.name() == 'h3'}.a.each { println it.text() }

The second example uses the ‘**’ operator to ask for all elements of type h3. However, if we keep comparing it with the way it is done with JQuery we find the solution quite complex.

Jsoup

With Jsoup is really easy to fetch and parse an url.

  • We just define our dependency in the Jsoup library (thanks to grape).
  • Then we call the method connect in the Jsoup class. This creates a Connection object whose parameters can be modified to allow things like setting cookies on it.
  • After creating the Connection object, we call it’s get method to retrieve the webpage, parse it as a DOM and return a Document object.
@Grab('org.jsoup:jsoup:1.10.2')
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document

Document doc = Jsoup.connect("http://www.bing.com/search?q=web+scraping").get();

CSS selectors

JSoup’s most important feature is that it allows to use CSS selectors, a way to identify parts of a webpage that should be familiar to any JQuery or CSS user.

With the Document object we got before, the full code for filtering the links of interest for our example would be:

def results = doc.select("#results h3 a")

References

Related