StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Whether Perl or not, sometimes the problem with a regular expression is its greediness. Let's say I want to capture the first name of someone and the string looked like this:</p> <pre><code>Bob Baker </code></pre> <p>I could use this regular expression:</p> <pre><code>sed 's/^\(.*)\ .*$/\1/' </code></pre> <p>That would work with <em>Bob Baker</em>, but not with <em>Bob Barry Baker</em>. The problem is that my regular expression is greedy and will select all of the characters up to the <em>last</em> space, so I would end up not with <code>Bob</code> but with <code>Bob Baker</code>. A common way to solve this is to specify all the characters <em>except</em> for the one you don't want:</p> <pre><code>sed 's/^\([^ ]*)\ .*$/\1/' </code></pre> <p>In this case, I am specifying any set of characters <em>not</em> including a space. This will change both <code>Bob Baker</code> and <code>Bob Rudolph Baker</code> to just <code>Bob</code>.</p> <p>Perl has another way of specifying an non-greedy regular expression. In Perl, you add a <code>?</code> to your sub-expression you want to be not greedy. In the above example, both of these will change a string containing <code>Bob Barry Baker</code> to just <code>Bob</code>:</p> <pre><code>$string =~ s/^([^ ]+) .*$/$1/; $string =~ s/^(.+?) .*$/$1/; </code></pre> <p>By the way, <strong>these are <em>not</em> equivalent</strong>!</p> <p>With the <em>everything but a space</em> regex, I could do this:</p> <pre><code> $string =~ /^([^ ]+)( )(\[\d{4}\])( )(\(\d+p\))(\.)([^.]+)/ </code></pre> <p>With the non-greedy qualifier:</p> <pre><code>$string =~ /^(.+?)( )(\[\d{4}\])( )(\(\d+p\))(\.)(.*)/ </code></pre> <p>And, using the <code>x</code> qualifier which allows you to put the same regular expression over multiple lines which is nice because you can add comments to help explain what you're doing:</p> <pre><code>$string =~ / ^(.+?) #Any set of characters (non-greedy) ([ ]) #Space (\[\d{4}\]) #[1959] ([ ]) #Space (\([0-9]+p\)) #(430p) [.] #Period ([^\.]+) #File Suffix (no period) /x </code></pre> <p>And, at this point, you might as well follow Damian Conway's <em>Best Practice</em> recommendations on Perl regular expressions.</p> <pre><code>$string =~ / \A #Start of Regular Expression Anchor ( .+? ) #Any set of characters (non-greedy) ( [ ] ) #Space ( \[ \d{4} \] ) #[1959] ( [ ] ) #Space ( \( [0-9] +p \) ) #(430p) ( [.] ) #Period ( [^\.]+ ) #File Suffix (no period) \Z #End of string anchor /xm; </code></pre> <p>Since <code>x</code> ignores <em>all</em> white space, I can even add spaces between subgroups on the same line. In this case, <code>( .*+? )</code> is just a bit cleaner than <code>(.*+?)</code>. Whether <code>( \( [0-9] +p \) )</code> or <code>( \( [0-9]+p \) )</code> or even <code>( \([0-9]+p\) )</code> is easier to understand is up to you.</p> <p>And, yes the answer looks very much like <a href="https://stackoverflow.com/a/10397739/368630">Sinan's</a> answer. </p> <p>By the way, as Sinan showed, using the non-greedy regular expression qualifier is able to parse <code>a b c d e [1234] (1080p).mov</code> while using the <em>everything that doesn't include a space</em> sub-expression wouldn't. That's why I said they're not the same.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload