Note that there are some explanatory texts on larger screens.

plurals
  1. POcut specific columns from several files and reshape using unix tools
    primarykey
    data
    text
    <p>I have several hundred files in a folder. Each of these file is a tab delimited text file that contain more than a million rows and 27 columns. From each file, I want to be able to extract only specific columns (say pull out only columns: 1,2,11,12,13). Columns 3:10 &amp; 14:27 can be ignored. I want to be able to do this for all files in the folder (say 2300 files). The columns from each of the 2300 file looks like this..........</p> <pre><code>Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567890_A rs758676 - - 1 T T - ....col27 1234567890_A rs3916934 - - 1 T T - ....col27 1234567890_A rs2711935 - - 1 T C - ....col27 1234567890_A rs17126880 - - 1 - - - ....col27 1234567890_A rs12831433 - - 1 T T - ....col27 1234567890_A rs12797197 - - 1 T C - ....col27 </code></pre> <p>The cut columns from the 2nd file may look like this.... </p> <pre><code>Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567899_C rs758676 - - 100 T A - ....col27 1234567899_C rs3916934 - - 100 T T - ....col27 1234567899_C rs2711935 - - 100 T C - ....col27 1234567899_C rs17126880 - - 100 C G - ....col27 1234567899_C rs12831433 - - 100 T T - ....col27 1234567899_C rs12797197 - - 100 T C - ....col27 </code></pre> <p>The cut columns from the 3rd file may look like this.... </p> <pre><code>Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27 1234567999_F rs758676 - - 256 A A - ....col27 1234567999_F rs3916934 - - 256 T T - ....col27 1234567999_F rs2711935 - - 256 T C - ....col27 1234567999_F rs17126880 - - 256 C G - ....col27 1234567999_F rs12831433 - - 256 T T - ....col27 1234567999_F rs12797197 - - 256 C C - ....col27 </code></pre> <p>The width of the <code>Sample.ID</code>, <code>Sample.Index</code> are the same in each file but can change between files. The value of <code>Sample.ID</code> is the same within each file but different between files. Each of the cut files have the same values under "SNP.Name" column. The <code>Sample.Index</code> column may sometimes be same from different file. The other two columns values <code>(Allele1...Forward &amp; Allele2...Forward)</code> may change, and are pasted with " " sep under each <code>SNP.Name</code> for each <code>Sample.ID</code>. </p> <p>I finally want to merge (tab-delemited) all the cut columns from the 2300 files into this format ......</p> <pre><code>Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197 1 1234567890_A T T T T T C 0 0 T T T C 200 1234567899_C T A T T T C C G T T T C 256 1234567999_F A A T T T C C G T T C C </code></pre> <p>In simple terms I want to be able to convert a long format into wide format based on the <code>Sample.ID</code> column. This is similar to <code>reshape</code> function in R. I tried this with R and it runs out of memory and is really slow. Can anyone help with unix tools? </p> <p>When reshape.sh was applied to 20 files... it produced a spurious "Samples line" in the output. The first 4 fields are featured here.</p> <pre><code>Sample.Index Sample.ID rs476542 rs7073746 1234567891_A 11 C C A G 1234567892_A 191 T C A G 1234567893_A 204 T C G G 1234567894_A 15 T C A G 1234567895_A 158 T T A A 1234567896_A 208 T C A A 1234567897_A 111 T T G G 1234567898_A 137 T C G G 1234567899_A 216 T C A G 1234567900_A 113 T C G G 1234567901_A 152 T C A G 1234567902_A 178 C C A A 1234567903_A 135 C C A A 1234567904_A 125 T C A A 1234567905_A 194 C C A A 1234567906_A 110 C C G G 1234567907_A 126 C C A A Sample - 1234567908_A 169 C C G G 1234567909_A 173 C C G G 1234567910_A 168 T C A A </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload