You will need:
- HtmlAgilityPack HTML Parser
- Development environment
- Internet connection
Create a Visual Studio project, for example C# Windows Forms application. Drop a TextBox, a Button and a ListView on the form. Creat a class for the methods to be used, let's say Helper.cs. First, I'm using the System.Net.Webclient to call Google and get a page of search results.
public static WebClient webClient = new WebClient(); public static string GetSearchResultHtlm(string keywords) { StringBuilder sb = new StringBuilder("http://www.google.com/search?q="); sb.Append(keywords); return webClient.DownloadString(sb.ToString()); }
The string that is returned is the html of the first page of the Google search for the string that is passed to the method. Opened in the web browser, it will look something like this
Google search result page
What I want to extract is the actual links, which are marked in red on the screenshot above. Here I'm going to use HtmlAgilityPack to load the string into the HtmlDocument object. After the string is loaded, I will use a simple LINQ query to extract the nodes that match certain conditions: They are html links (a href), and the URL of the link contains either "/url?" or "?url=". By this point, I get quite and unreadable list of values.
Raw URLs
To bring it into readable form, I'll match it to a regular expression and then load the results into the ListView. Here is the code:
public static Regex extractUrl = new Regex(@"[&?](?:q|url)=([^&]+)", RegexOptions.Compiled); public static List<String> ParseSearchResultHtml(string html) { List<String> searchResults = new List<string>(); var doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); var nodes = (from node in doc.DocumentNode.SelectNodes("//a") let href = node.Attributes["href"] where null != href where href.Value.Contains("/url?") || href.Value.Contains("?url=") select href.Value).ToList(); foreach (var node in nodes) { var match = extractUrl.Match(node); string test = HttpUtility.UrlDecode(match.Groups[1].Value); searchResults.Add(test); } return searchResults; }
Here is the result:
Final Results
I'm not quite sure why this may be useful, but as an exercise it is possible to add an option to parse through a certain number of pages, rather than just the first page. But if you try to run those queries in an automated mode, Google will soon start serving 503 errors to you.
by Evgeny. Also posted on my website
No comments:
Post a Comment