Downloading and Parsing Google Search Results

Google blocks accesses to its search results from unsupported user agents as a deterrent to keep people from stealing its intellectual property. This is rarely a concern when accessing Google from a web browser, but it becomes a bit of a problem when attempting to download from Google using the JRE’s URL class. For example, Google returns HTTP response code 403 (Forbidden) when executing the following code from a command-line tool.

URL url = new URL("http://www.google.com/search?q=algorithms");
url.openStream(); // throws java.io.IOException

Additionally, Google’s results are somewhat difficult to parse, perhaps as another deterrent to hackers. Although we don’t have to download and parse Google search results for Implementation Assignment #2, it would be beneficial to understand these techniques.

This tutorial will also provide a brief overview of Google’s PageRank calculation, which is a graph algorithm similar to the one required by Implementation Assignment #2.

Google’s PageRank Calculation

Academic literature counts citations, or backlinks, to a given paper. This gives some approximation of a paper’s importance or quality.

Google’s PageRank extends this idea to the web. Google bases the importance of a page A on the number of links from other pages T_i that point to page A. Furthermore, its importance is normalized by the total number of links on each page T_i.

Specifically, PageRank PR is defined as follows:

PR(A) = (1-d) + d (PR(T₁)/C(T₁) + ... + PR(T_n)/C(T_n))

We assume page A has pages T₁...T_n that point to it (i.e., citations). The parameter d is a damping factor that can be set between 0 and 1. Google usually sets d to 0.85. C(A) is defined as the number of outgoing links from page A.

Note that the PageRanks form a probability distribution over web pages, so the sum of the PageRanks for all web pages is one.

Downloading and Parsing

The following command-line program encodes command-line arguments as a Google search query, downloads the results, and displays the top ten corresponding links as standard output.

For example, providing the arguments “jade” and “cheng”, the following results were obtained.

$ bash-3.00$ java GoogleSearch jade cheng
Downloading [http://www.google.com/search?q=jade+cheng]...

  http://www2.hawaii.edu/~yucheng/
  http://www2.hawaii.edu/~yucheng/ta/ics-211-spring-2010/
  http://www.linkedin.com/pub/dir/Jade/Cheng
  http://www.jade-cheng.com/home/jade-cheng.pdf
  http://www.i-m-u.net/
  http://www.articlesbase.com/authors/jade-cheng/32958
  http://twitter.com/abcdefghijade
  http://www.facebook.com/people/Precy-Jade-Cheng/1493988487
  http://en-gb.facebook.com/people/Jade-Cheng/736990743
  http://investing.businessweek.com/research/stocks/people/person.asp?personId=9165498&ticker=BRCNF:US

The program first constructs a URL for a Google query using the command-line arguments. For example, the following URL searches for “jade” and “cheng”.

http://www.google.com/search?q=jade+cheng

Then the program explicitly sets the user agent used to download content from Google. This is accomplished using the URLConnection class. The program then downloads the HTML file for the constructed URL. Finally, the program parses the HTML, extracts the links for the top ten results, and prints those links to standard output.

The complete source code is provided below.

/*
 * Copyright (c) 2011, Jade Cheng
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *     * Redistributions of source code must retain the above copyright
 *       notice, this list of conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above copyright
 *       notice, this list of conditions and the following disclaimer in the
 *       documentation and/or other materials provided with the distribution.
 *     * Neither the name of the University of Hawaii nor the
 *       names of its contributors may be used to endorse or promote products
 *       derived from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY Jade Cheng ''AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED. IN NO EVENT SHALL Jade Cheng BE LIABLE FOR ANY
 * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import java.util.ArrayList;
import java.util.List;

/**
 * This program encodes command-line arguments as a Google search query,
 * downloads the results, and displays the corresponding links as output.
 */
public final class GoogleSearch {

    /**
     * The main entry point of the program.
     * 
     * @param args
     *            The command-line arguments. These arguments are encoded as a
     *            Google search query.
     */
    public static void main(final String[] args) {
        // Check for usage errors.
        if (args.length == 0) {
            System.out.println("usage: GoogleSearch query ...");
            return;
        }

        // Catch IO errors that may occur while encoding the query, downloading
        // the results, or parsing the downloaded content.
        try {
            // Encode the command-line arguments as a Google search query.
            final URL url = encodeGoogleQuery(args);

            // Download the content from Google.
            System.out.println("Downloading [" + url + "]...\n");
            final String html = downloadString(url);

            // Parse and display the links.
            final List<URL> links = parseGoogleLinks(html);
            for (final URL link : links)
                System.out.println("  " + link);

        } catch (final IOException e) {
            // Display an error if anything fails.
            System.err.println(e.getMessage());
            System.exit(1);
        }
    }

    /**
     * Reads all contents from an input stream and returns a string from the
     * data.
     * 
     * @param stream
     *            The input stream to read.
     * 
     * @return A string built from the contents of the input stream.
     * 
     * @throws IOException
     *             Thrown if there is an error reading the stream.
     */
    private static String downloadString(final InputStream stream)
            throws IOException {
        final ByteArrayOutputStream out = new ByteArrayOutputStream();
        int ch;
        while (-1 != (ch = stream.read()))
            out.write(ch);
        return out.toString();
    }

    /**
     * Downloads the contents of a URL as a String. This method alters the
     * User-Agent of the HTTP request header so that Google does not return
     * Error 403 Forbidden.
     * 
     * @param url
     *            The URL to download.
     * 
     * @return The content downloaded from the URL as a string.
     * 
     * @throws IOException
     *             Thrown if there is an error downloading the content.
     */
    private static String downloadString(final URL url) throws IOException {
        final String agent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)";
        final URLConnection connection = url.openConnection();
        connection.setRequestProperty("User-Agent", agent);
        final InputStream stream = connection.getInputStream();
        return downloadString(stream);
    }

    /**
     * Encodes a string of arguments as a URL for a Google search query.
     * 
     * @param args
     *            The array of arguments to pass to Google's search engine.
     * 
     * @return A URL for a Google search query based on the arguments.
     */
    private static URL encodeGoogleQuery(final String[] args) {
        try {
            final StringBuilder localAddress = new StringBuilder();
            localAddress.append("/search?q=");

            for (int i = 0; i < args.length; i++) {
                final String encoding = URLEncoder.encode(args[i], "UTF-8");
                localAddress.append(encoding);
                if (i + 1 < args.length)
                    localAddress.append("+");
            }

            return new URL("http", "www.google.com", localAddress.toString());

        } catch (final IOException e) {
            // Errors should not occur under normal circumstances.
            throw new RuntimeException(
                    "An error occurred while encoding the query arguments.");
        }
    }

    /**
     * Parses HTML output from a Google search and returns a list of
     * corresponding links for the query. The parsing algorithm is crude and may
     * not work if Google changes the output of their results. This method works
     * adequately as of February 28, 2011.
     * 
     * @param html
     *            The HTML output from Google search results.
     * 
     * @return A list of links for the query.
     * 
     * @throws IOException
     *             Thrown if there is an error parsing the results from Google
     *             or if one of the links returned by Google is not a valid URL.
     */
    private static List<URL> parseGoogleLinks(final String html)
            throws IOException {

        // These tokens are adequate for parsing the HTML from Google. First,
        // find a heading-3 element with an "r" class. Then find the next anchor
        // with the desired link. The last token indicates the end of the URL
        // for the link.
        final String token1 = "<h3 class=\"r\">";
        final String token2 = "<a href=\"";
        final String token3 = "\"";

        final List<URL> links = new ArrayList<URL>();

        try {
            // Loop until all links are found and parsed. Find each link by
            // finding the beginning and ending index of the tokens defined
            // above.
            int index = 0;
            while (-1 != (index = html.indexOf(token1, index))) {
                final int result = html.indexOf(token2, index);
                final int urlStart = result + token2.length();
                final int urlEnd = html.indexOf(token3, urlStart);
                final String urlText = html.substring(urlStart, urlEnd);
                final URL url = new URL(urlText);
                links.add(url);

                index = urlEnd + token3.length();
            }

            return links;

        } catch (final MalformedURLException e) {
            throw new IOException("Failed to parse Google links.");
        } catch (final IndexOutOfBoundsException e) {
            throw new IOException("Failed to parse Google links.");
        }
    }
}

References

Sergey Brin, Lawrence Page The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, Vol. 30 (1998), pp. 107-117.