Note that there are some explanatory texts on larger screens.

plurals
  1. POgrep large files with multiple patterns and count occurences
    primarykey
    data
    text
    <p>I have a large number of files with millions of lines. Then I have a two column file which is called contaminant_list. For every file I need to get number of occurrences of all patterns, always with the name of the pattern on left side and number of its occurrences on the right side.</p> <p>This command is working just fine:</p> <pre><code>while read line; do name=$(echo $line | cut -f1 -d' '); seq=$(echo $line | cut -f2 -d' '); echo -n $(date) $name "looking for $seq" &gt;&gt;adapt_contamination_log; egrep $seq $name_of_file | wc -l &gt;&gt;adapt_contamination_log; done &lt;contaminant_list.txt; </code></pre> <p>and yields in:</p> <pre><code>Thu Sep 19 23:04:38 EDT 2013 &gt;PrefixAdapter4/1 looking for GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG 0 Thu Sep 19 23:05:55 EDT 2013 &gt;PrefixAdapter4/2 looking for CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC 0 Thu Sep 19 23:07:09 EDT 2013 &gt;PrefixAdapter16/1 looking for GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTG 2611 </code></pre> <p>..and so on (number of patterns I am matching is quite large) Important is that pattern GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTG is in my large file 2611 times.</p> <p>However, it is very slow. Is there any way to match all patterns at the same time so that the file could be read just once?</p> <p>Here is how contaminant_list.txt looks like:</p> <pre><code>TruSeqAdapter,Index12 GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG IlluminaRNARTPrimer GCCTTGGCACCCGAGAATTCCA IlluminaRNAPCRPrimer AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA RNAPCRPrimer,Index1 CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA RNAPCRPrimer,Index2 CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA RNAPCRPrimer,Index3 CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA </code></pre> <p>I was thinking about writing perl script with hash, however - my experience is that inbuilt bash solutions always work better. Do you have any ideas how to do this, please?</p> <p>Thanks.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload