Friday, September 7, 2012

Playing with Google Search Results

You will need:

Create a Visual Studio project, for example C# Windows Forms application. Drop a TextBox, a Button and a ListView on the form. Creat a class for the methods to be used, let's say Helper.cs. First, I'm using the System.Net.Webclient to call Google and get a page of search results.

public static WebClient webClient = new WebClient();

public static string GetSearchResultHtlm(string keywords)
{
    StringBuilder sb = new StringBuilder("http://www.google.com/search?q=");
    sb.Append(keywords);
    return webClient.DownloadString(sb.ToString());
}

The string that is returned is the html of the first page of the Google search for the string that is passed to the method. Opened in the web browser, it will look something like this

Google search result page

What I want to extract is the actual links, which are marked in red on the screenshot above. Here I'm going to use HtmlAgilityPack to load the string into the HtmlDocument object. After the string is loaded, I will use a simple LINQ query to extract the nodes that match certain conditions: They are html links (a href), and the URL of the link contains either "/url?" or "?url=". By this point, I get quite and unreadable list of values.

Raw URLs

To bring it into readable form, I'll match it to a regular expression and then load the results into the ListView. Here is the code:

public static Regex extractUrl = new Regex(@"[&?](?:q|url)=([^&]+)", RegexOptions.Compiled);

public static List<String> ParseSearchResultHtml(string html)
{
    List<String> searchResults = new List<string>();

    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    var nodes = (from node in doc.DocumentNode.SelectNodes("//a")
                 let href = node.Attributes["href"]
                 where null != href
                 where href.Value.Contains("/url?") || href.Value.Contains("?url=")
                 select href.Value).ToList();

    foreach (var node in nodes)
    {
        var match = extractUrl.Match(node);
        string test = HttpUtility.UrlDecode(match.Groups[1].Value);
        searchResults.Add(test);
    }

    return searchResults;
}

Here is the result:

Final Results

I'm not quite sure why this may be useful, but as an exercise it is possible to add an option to parse through a certain number of pages, rather than just the first page. But if you try to run those queries in an automated mode, Google will soon start serving 503 errors to you.

by . Also posted on my website

No comments: