Note that there are some explanatory texts on larger screens.

plurals
  1. POAwk script to loop and perform mathematical operations
    primarykey
    data
    text
    <p>I have bash and awk script that I use to extract data from the text file.<br> However it is too slow with large datasets and doesn't work perfectly. I believe that it is possible to write all my bash loop in one awk command and I ask somebody to help me with this. </p> <pre><code>cat dummy_list AAA AAAA AAAAA cat dummy_table 13 19 AAA 69 96 "ID-999" 34 23 42 AAA 12 19 "ID-999" 64 53 79 AAA 43 58 "ID-482" 36 13 43 AAA 12 15 "ID-492" 75 23 90 AAA 45 87 "ID-492" 34 12 41 AAAA 76 79 "ID-923" 23 19 58 AAAA 15 87 "ID-923" 75 10 40 AAAA 18 82 "ID-482" 23 11 18 AAAA 18 82 "ID-482" 52 15 19 AAAA 18 82 "ID-482" 62 59 69 AAAA 10 18 "ID-482" 83 78 89 AAAA 32 41 "ID-983" 24 23 53 AAAAA 78 99 "ID-916" 82 </code></pre> <p>What I want from this table:</p> <ol> <li><p>For every dummy_list item (<code>AAA</code> or <code>AAAA</code> or <code>AAAAA</code>) extract how many different times ID range was mentioned ( by this I mean unique columns 4+5+6 (like <code>69 96 "ID-999"</code>)). There are duplicate ID's (like <code>18 82 "ID-482"</code>) and I have to discard them.<br> My script looks like this: </p> <pre><code>while read a; do awk -v VAR="$a" '($3==VAR) {print $4"\t"$5"\t"$6}' dummy_table | sort -u | cut -f 3 | sort | uniq -c | awk '{print $1}' | tr '\n' ' ' | awk -v VAR="$a" '{print VAR"\t"$0}' done &lt; dummy_list AAA 1 2 2 AAAA 2 2 1 AAAAA 1 </code></pre> <p>It's the same as <code>AAA "ID-482" mentioned once; "ID-492" mentioned twice; "ID-999" mentioned twice</code>.</p> <p>This is the output I want. </p></li> <li><p>For every dummy_list item get average number of how many times it gets mentioned with the same ID. For example <code>AAA</code> occurs twice with <code>"ID-999"</code>, one time with <code>"ID-482"</code> and two times with <code>"ID-492"</code> - so it's (2+1+2)/3=1.66 </p> <p>My script looks like this: </p> <pre><code>while read a ; do ID_TIMES=$(awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table | sort -u | wc -l) &amp;&amp; awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table | sort | uniq -c | awk -v VAR="$ID_TIMES" '{sum+=$1} END {print sum/VAR}' done &lt; dummy_list AAA 1.666 AAAA 2.333 AAAAA 1 </code></pre></li> <li><p>For every dummy_list item extract ID range and calculate proportion between columns. For example:<br> <code>for AAA's ID-999:<br> RANGE1=sum $5-$4(96-69) + $5-$4(19-12)<br> RANGE2=sum $7(34+64)<br> then RANGE2*100/RANGE1=288</code><br> For the output like this: </p> <pre><code>AAA 288 240 242 .... AAAAA 390 </code></pre> <p>I wasn't able to write such script by myself as I got stuck with two variables $RANGE1 and $RANGE2.<br> If it it possible it would be great to discard duplicate ranges like <code>18 82 "ID-482"</code> in this step as well.</p></li> </ol> <p>I believe that all these there operations can be calculated with only one <code>awk</code> command and I feel desperate about my scripts. I really hope that someone will help me in this operation. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload