Note that there are some explanatory texts on larger screens.

plurals
  1. POMatching a Repeating Sub Series using a Regular Expression with PowerShell
    text
    copied!<p>I have a text file that lists the names of a large number of Excel spreadsheets, and the names of the files that are linked to from the spreadsheets.</p> <p>In simplified form it looks like this:</p> <pre><code>"Parent File1.xls" Link: ChildFileA.xls Link: ChildFileB.xls "ParentFile2.xls" "ParentFile3.xls" Blah Link: ChildFileC.xls Link: ChildFileD.xls More Junk Link: ChildFileE.xls "Parent File4.xls" Link: ChildFileF.xls </code></pre> <p>In this example, ParentFile1.xls has embedded links to ChildFileA.xls and ChildFileB.xls, ParentFile2.xls has no embedded links, and ParentFile3.xls has 3 embedded links.</p> <p>I am trying to write a regular expression in PowerShell that will parse the text file producing output in the following form:</p> <pre><code>ParentFile1.xls:ChildFileA.xls,ChildFileB.xls ParentFile3.xls:ChildFileC.xls,ChildFileD.xls,ChildFileE.xls etc </code></pre> <p>The task is complicated by the fact that the text file contains a lot of junk between each of the lines, and a parent may not always have a child. Furthermore, a single file name may pass over multiple lines. However, it's not as bad as it sounds, as the parent and child file names are always clearly demarcated (the parent with quotes and the child with a prefix of Link: ).</p> <p>The PowerShell code I've been using is as follows:</p> <pre><code>$content = [string]::Join([environment]::NewLine, (Get-Content C:\Temp\text.txt)) $regex = [regex]'(?im)\s*\"(.*)\r?\n?\s*(.*)\"[\s\S]*?Link: (.*)\r?\n?' $regex.Matches($content) | %{$_.Groups[1].Value + $_.Groups[2].Value + ":" + $_.Groups[3].Value} </code></pre> <p>Using the example above, it outputs:</p> <pre><code>ParentFile1.xls:ChildFileA.xls ParentFile2.xls""ParentFile3.xls:ChildFileC.xls ParentFile4.xls:ChildFileF.xls </code></pre> <p>There are two issues. Firstly, the inclusion of the "" instead of a newline whenever a Parent without a Child is processed. And the second issue, which is the most important, is that only a single child is ever shown for each parent. I'm guessing I need to somehow recursively capture and display the multiple child links that exist for each parent, but I'm totally stumped as to how to do this with a regular expression.</p> <p>Amy help would be greatly appreciated. The file contains 100's of thousands of lines, and manual processing is not an option :)</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload