Thursday, September 25, 2008

Pipes, Loops, and Exploding Memory Usage

Background

Recently I wrote a script to automate some reports that are a huge pain.  I was pretty pleased with myself when I finished, but when I ran it, it kept going...and going...  It was taking a really long time, which might not have been strange because there was a lot of data, but I popped up my Task Manager, and that's when I noticed that powershell.exe was using up 1GB of RAM and climbing.  Clearly I had a problem with the design of the script, but what shocked me was that I was able to fix this  by replacing a foreach loop with a pipe to foreach-object, and the end result was that my powershell.exe process never uses more than 55MB of RAM.


Passing Objects Down the Pipe

One of the cool things about pipes is that as data is generated by a cmdlet or function it is passed down the pipe to the next one without having to wait for all of the data finish being generated. 

Consider the following:

C:\PowerShell> dir c:\ -include *.log -recurse | % {$_.FullName}

As each file is found that matches the pattern, it will be returned.  Now let's try it with a foreach loop:

foreach ($file in (dir c:\ -include *.log -recurse)) {
  $_.FullName
}

This time we have to wait for the entire hard drive to be scanned before the output comes out, and we'll use a lot more memory.  Why?   Because when you use parentheses, the expression between them is evaluated BEFORE the loop is processed.  This is essentially the same as the following:

$files = dir c:\ -include *.log -recurse

foreach ($file in $files) {
  $_.FullName
}

Most things you use PowerShell for probably won't be so large that this becomes a huge issue.  In my case I was querying Systems Management Server for inventory information on tens of thousands of computers, so it really started to impact the other things I was using.


Planning Ahead

As you're creating your scripts, try to be conscious of where you're using piped commands vs. loops, and consider how it would change your script if you refactored the code to do it a different way.  I tend to use loops more when I'm writing scripts because they are generally more readable and easier to update for the next poor sap who has to edit my code, but it's important no matter which way you choose to get the job done that you try to understand the flow of execution of your script.  

Some questions I try to ask myself when I think I'm done with my scripts:
  • Where am I causing my script to stop and collect all of the input pipeline before continuing?  (sorting in the middle of the pipeline is the classic example of this)  Does it matter?
  • What variables am I declaring at the top level that can be moved so that they are deleted automatically when they leave scope?
  • What is the impact on readability?

No comments: