Wednesday, January 25, 2012

Regular Expressions: Using Lookaheads to Group and Grab Exactly the Text You Want

My previous post used zero-width lookahead and lookbehind assertions to grab some text from a gnarly-looking string, so I thought I'd follow up with a quick post on how that works.  It's not as complicated as the name sounds.

I had this string, from which I wanted to extract the domain and username:

\\SERVER\root\cimv2:Win32_Group.Domain="MYDOMAIN",Name="adminuser"

I know that I want the text between the double-quotes immediately following the words "Domain" and "Name".  I decided on this approach:

$string -match '(?<=Domain\=")(?<domain>[^"]+).*(?<=Name\=")(?<name>[^"]+)'

The characters in blue are, as described in the previous post, named groups, which will be captured and assigned in the automatic variable $matches with those names (Eg. $matches.domain).  The characters in red are the zero-width lookbehind assertions.

So what are they good for?  You can use lookaheads and lookbehinds if you want to make sure that a specific pattern comes before or after the pattern you want to capture, but don't actually want that pattern to be captured.  They look like groups, but will not be added to $matches.

A lookbehind assertion looks like this:

(?<=YOUR_PATTERN_HERE)

A lookahead assertion looks like this:

(?=YOUR_PATTERN_HERE)

Ah, but what if I want to make sure that a certain pattern does not follow my group?  Just replace the equality sign with an exclamation point, like so:

(?<!YOUR_PATTERN_HERE)
(?!YOUR_PATTERN_HERE)

So let's break down what my regex does:

# Check that the pattern 'Domain\="' is in the string, 
# but do not capture this group.
(?<=Domain\=") 

# Immediately following it, capture one or more characters that are not the 
# double-quote character and name this group "domain"
(?<domain>[^"]+)

# Match zero or more of any character.
.*

# Check that the pattern 'Name\="' is in the string, 
# but do not capture this group.
(?<=Name\=")

# Immediately following it, capture one or more characters that are not the 
# double-quote character and name this group "name"
(?<name>[^"]+)



3 comments:

Anonymous said...

"The characters in red are the zero-width lookbehind assertions"

You mean the characters in blue..

Tim Johnson said...

D'oh! Yes, of course you're right. I'll fix it.

Anonymous said...

Relating to the color mix-up, it stills persists: blue are the look-behinds, red are the named groups.

Nice article, though.