Friday, October 30, 2009

Working with CSV Files

It's been a while since I had an update to the blog, and I've been waiting for something to jump out at me. Today I read on the news that the White House has released some visitor logs to the public, and they are making them available for download as a CSV file, and it piqued my interest, so I downloaded the file.

PowerShell v1 and v2 both have a module called Import-CSV that makes it ridiculously easy to work with CSV files. They will slurp up the file and output an array of objects, one per row, with properties that correspond to the columns in the file.

Let's import the file and look at a record:

PS> $visitors = Import-Csv .\10-30-waves-posting.csv
PS> $visitors[-1]


NAMELAST : ZIMPHER
NAMEFIRST : NANCY
NAMEMID : L
UIN : U81194
BDGNBR : 72400
ACCESS_TYPE : VA
TOA : 4/1/20091:45:42PM
POA : D0102
TOD : 4/1/20095:50:21PM
POD : D1
APPT_MADE_DATE : 3/25/20095:07:33PM
APPT_START_DATE : 4/1/20092:00:00PM
APPT_END_DATE : 4/1/200911:59:00PM
APPT_CANCEL_DATE :
Total_People : 3
LAST_UPDATEDBY : J7
POST : WIN
LastEntryDate : 3/25/20095:08:57PM
TERMINAL_SUFFIX : J7
visitee_namelast : DOUGLAS
visitee_namefirst : DEREK
MEETING_LOC : OEOB
MEETING_ROOM : 459
CALLER_NAME_LAST : WILKINS
CALLER_NAME_FIRST : ELIZABETH
CALLER_ROOM :
Description :

Interesting. I wonder, who had the most visitors (I snipped out the middle of the results)?

PS> $visitors |
Group-Object visitee_namelast |
Sort-Object Count |
Select-Object Count, Name

Count Name
----- ----
1 ANDREW
1 LONG
1 BINNIX
1 BREWER
...
7 'SUMMERS'
7 JARRETT
8 'POTUS/FLOTUS'
8 'EMANUEL'
9 POTUS/FLOTUS
9 'DOEBLER'
10 DOEBLER
10 'TCHEN'
12 TCHEN
42 'POTUS'
56 POTUS
69 OFFICE
211

Uh oh. There's something funky about this file. There are empty lines, and it's inconsistent about when it puts quotes around text fields. I'm pretty sure we don't have any fields with commas in them, so let's try stripping out any lines that only have whitespace and/or commas in them and then strip out single quotes from the rest of the lines:

PS> Get-Content .\10-30-waves-posting.csv |
?{$_ -notmatch '^(\s|,)*$'} |
%{$_.replace("'", '')} > wh.csv

** Note: that regex after -notmatch above can be read as:
^ start at the beginning of the string
(\s|,)* zero or more whitespace or comma characters
$ the end of the string

There. Now we can try that again, using our new file (snipped for length again).

PS> $visitors = Import-CSV .\wh.csv
PS> $visitors |
Group-Object visitee_namelast |
Sort-Object Count |
Select-Object Count, Name
Count Name
----- ----
1 SHAH
1 LONG
1 PETER
1 ANDREW
...
19 DOEBLER
22 TCHEN
69 OFFICE
98 POTUS

So now we know that the person with the largest number of visitors was the President of the United States of America. Hardly surprising, but you can easily explore this data with PowerShell. Here are a few more queries you might try:

Look at the records of the visits to the President

PS> $visitors |
?{$_.visitee_namelast -eq 'POTUS'}

How many visits occurred in July?

PS> $visitors |
?{$_.APPT_START_DATE -match '^7/'} |
sort APPT_START_DATE |
select NAMELAST, NAMEFIRST, APPT_START_DATE

What was the description given of the appointments for meeting the President?

PS> $visitors |
?{$_.visitee_namelast -eq 'POTUS'} |
select APPT_START_DATE, NAMEFIRST, NAMELAST, Description |
sort APPT_START_DATE


Hopefully this gives you a good idea of how you can manipulate CSV data using PowerShell using Import-CSV and a little filtering.