Sunday, July 6, 2008

Using the .NET WebClient to Scrape Web Pages

.NET comes with a nifty little class called System.Net.WebClient that lets you easily interact with a web page.

To play with it I decided to scrape the output of this page that generates Shakespearean insults and grab just the insult from the output, giving me easy command-line access to random Shakespearean insults (something I often find myself in need of, to be sure).

# Retrieves a random Shakespearean insult from the Internet.
#
# Author: Tojo2000 <tojo2000@tojo2000.com>
# (c)2008 All Rights Reserved
#
# Usage: get-insult.ps1

$regex = New-Object System.Text.RegularExpressions.Regex('\n([^<>]+)\n');

$web_client = New-Object System.Net.WebClient;
$web_client.Headers.Add("user-agent", "PowerWeb");

$data = $web_client.DownloadString("http://www.pangloss.com/seidel/Shaker/index.html");

if ($match = $regex.Match($data)) {
  echo $match.Groups[1].Value;
}

Note: I'm not affiliated with this website, so obviously don't abuse it.  It's just an example

No comments: