Note that there are some explanatory texts on larger screens.

plurals
  1. PONeed help in improving the speed of my code for duplicate columns removal in Python
    primarykey
    data
    text
    <p>I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.</p> <p>The input file looks like this:</p> <blockquote> <p>chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406<br> chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407<br> chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407<br> chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254 </p> </blockquote> <p>As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.</p> <p>I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.</p> <p>Thanks!</p> <pre><code>#!/usr/bin/env python """ takes in a input file and prints out only the variants that occur more than once """ import shlex import collections rows = open('variants.txt', 'r').read().split("\n") # removing the header and storing it in a new variable header = rows.pop() indices = [] for row in rows: var = shlex.split(row) indices.append("_".join(var[0:3])) dup_list = [] ind_tuple = collections.Counter(indices).items() for x, y in ind_tuple: if y&gt;1: dup_list.append(x) print dup_list print len(dup_list) </code></pre> <p>Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.</p> <p>EDIT: Edited the code as per the suggestion of damienfrancois. Below is my new code:</p> <pre><code>f = open('variants.txt', 'r') indices = {} for line in f: row = line.rstrip() var = shlex.split(row) index = "_".join(var[0:3]) if indices.has_key(index): indices[index] = indices[index] + 1 else: indices[index] = 1 dup_pos = 0 for key, value in indices.items(): if value &gt; 1: dup_pos = dup_pos + 1 print dup_pos </code></pre> <p>I used, time to see how long both the code takes.</p> <p>My original code:</p> <pre><code>time run remove_dup.py 14428 CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s Wall time: 209.31 s </code></pre> <p>Code after modification:</p> <pre><code>time run remove_dup2.py 14428 CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s Wall time: 222.76 s </code></pre> <p>I don't see any significant improvement in the time.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload