Note that there are some explanatory texts on larger screens.

plurals
  1. POawk - how to extract a pattern
    primarykey
    data
    text
    <p>Asking for instructions about using awk to extract text blocks with specific rows from a file.</p> <p>The file has the following structure:</p> <pre><code>&lt;Information&gt; &lt;CID&gt;_whole_number_A_&lt;/CID&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;string&gt;_PATTERN_A_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;/Information&gt; &lt;Information&gt; &lt;CID&gt;_whole_number_B_&lt;/CID&gt; &lt;string&gt;_PATTERN_B_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;string&gt;_text_that_is_not_useful_&lt;/string&gt; &lt;/Information&gt; </code></pre> <p>Would like to awk to send the following pattern to a new file.</p> <pre><code>&lt;Information&gt; &lt;CID&gt;_whole_number_A_&lt;/CID&gt; &lt;string&gt;_PATTERN_A_&lt;/string&gt; &lt;/Information&gt; &lt;Information&gt; &lt;CID&gt;_whole_number_B_&lt;/CID&gt; &lt;string&gt;_PATTERN_B_&lt;/string&gt; &lt;/Information&gt; </code></pre> <p>Notes about the data:</p> <ul> <li>The file has 300,000+ CID items; each identified with a unique whole number.</li> <li>The PATTERNs (_PATTERN_A_, _PATTERN_B_, etc.) have the format UNII-&lt;10 characters>. For example: UNII-4J4Z8788N8 or UNII-12L95QD6KV.</li> <li>Not every CID has a UNII.</li> </ul> <p>Notes about my environment:</p> <ul> <li>Am working under Windows 7 and using the GnuWin32 utilities</li> </ul> <p>So, rephrasing in English:</p> <blockquote> <p>in FILE_1</p> <p>find every CID that has a UNII</p> <p>send the filtered results to FILE_2</p> </blockquote> <p>Thanks in advance for instructions.</p> <p>========================================================================</p> <p>OK, I'm doing something wrong.</p> <p>In my first implementation, the program only returns "record starts" and "closing tag," i.e.:</p> <pre><code>&lt;Information&gt; &lt;/Information&gt; </code></pre> <p>Here is how I applied your instructions.</p> <p>First, I'm running Windows so changed to FS="\r\n"</p> <p>The first regular expression is UNII, so changed to /UNII/.</p> <p>The second regular expression is CID, which you used in your instructions. I made no change there.</p> <p>For the second instance of PATTERN, I changed to /UNII/.</p> <p>Here is how my substitutions look:</p> <pre><code>BEGIN { RS="&lt;Information&gt;" FS="\r\n" } /UNII/ { print RS for (i=1;i&lt;NF;i++) { if ($i ~ /CID/ || $i ~ /UNII/) { print $i } } print "&lt;/Information&gt;" } </code></pre> <p>Because I am using Windows, I use a full path to execute the GnuWin32 utilities and read/write data. So my .bat file looks like this:</p> <pre><code>C:\bin\awk -f C:\bin\script.awk &lt; C:\Users\Owner\data\input_file.txt &gt; C:\Users\Owner\data\output_file.txt </code></pre> <p>What am I doing wrong?</p> <p>================================================================================= Here is sample data:</p> <pre><code>&lt;Information&gt; &lt;CID&gt;1&lt;/CID&gt; &lt;Synonym&gt;Acetyl carnitine&lt;/Synonym&gt; &lt;Synonym&gt;O-Acetyl-L-carnitine&lt;/Synonym&gt; &lt;Synonym&gt;Ammonium, (3-carboxy-2-hydroxypropyl)trimethyl-, hydroxide, inner salt, acetate, DL-&lt;/Synonym&gt; &lt;Synonym&gt;UNII-07OP6H4V4A&lt;/Synonym&gt; &lt;Synonym&gt;_20+_more_&lt;/Synonym&gt; &lt;/Information&gt; &lt;Information&gt; &lt;CID&gt;10006&lt;/CID&gt; &lt;Synonym&gt;HYDANTOIN&lt;/Synonym&gt; &lt;Synonym&gt;UNII-I6208298TA&lt;/Synonym&gt; &lt;Synonym&gt;53760_FLUKA&lt;/Synonym&gt; &lt;Synonym&gt;NSC9226&lt;/Synonym&gt; &lt;Synonym&gt;_20+_more_&lt;/Synonym&gt; &lt;/Information&gt; &lt;Information&gt; &lt;CID&gt;10007&lt;/CID&gt; &lt;Synonym&gt;Lucofen SA&lt;/Synonym&gt; &lt;Synonym&gt;461-78-9&lt;/Synonym&gt; &lt;Synonym&gt;EINECS 207-314-9&lt;/Synonym&gt; &lt;Synonym&gt;STK664067&lt;/Synonym&gt; &lt;Synonym&gt;DEA No. 1645&lt;/Synonym&gt; &lt;Synonym&gt;UNII-NHW07912O7&lt;/Synonym&gt; &lt;Synonym&gt;CHEMBL1201269&lt;/Synonym&gt; &lt;Synonym&gt;HMS1376E21&lt;/Synonym&gt; &lt;Synonym&gt;_20+_more_&lt;/Synonym&gt; &lt;/Information&gt; </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload