Note that there are some explanatory texts on larger screens.

plurals
  1. POA search python script faster than the C equivalent
    primarykey
    data
    text
    <p>For my research I need to deal with huge text file (10gi) of biological sequences (fasta format) and more precisely I have to put into specials specific sequences with specifics id. a fasta sequence is something like that:</p> <p>>id|id_number (e.g. 102574)|stuff</p> <p>ATGCGAT.... ATGTC.. (multiple lines)</p> <p>so i wrote script to search into chunks of these big files in order to paralelize my search (and to use my 8 cpu) with the multiprocessing library of python </p> <p>the function that I inject into my multiprocess class is the following:</p> <pre><code>idlist=inP[0] # list of good id filpath=inP[1] # chunck of the big file idproc=inP[2] # id of the process ####################### fil=filpath.split('\n') del filpath f=open('seqwithid{0}'.format(idproc),'w') def lineiter(): for line in fil: yield line it=lineiter() line=it.next() while 1: try: ids=line.split('|')[1].split('locus')[0].partition('ref')[0] #print ids while ids[0].isalpha(): ids=ids[1:] except Exception: pass else: if ids in idlist: f.write(line+'\n') while 1: try: line=it.next() except Exception: break if line and line[0]!='&gt;': f.write(line+'\n') else: break try: line=it.next() except Exception: break while not line or line[0]!='&gt;': try: line=it.next() except Exception: break f.close() </code></pre> <p>In order to improve the speed i reworte this piece of code in C with 4 functions:</p> <p>I cut the file into chunk:</p> <pre><code>f1=fopen(adr, "r"); if (f1==0){printf("wrong sequences file: %s\n",adr);exit(1);} fstream = (char *) malloc((end-begin)*sizeof(char) ); fseek(f1,begin,SEEK_CUR); fread(fstream,sizeof(char)*(end-begin-1),1,f1); adrtampon=fgetc(f1); while (!(feof(f1)) &amp;&amp; adrtampon!=ter) { sprintf(fstream,"%s%c",fstream,adrtampon); adrtampon=fgetc(f1); } fclose(f1); </code></pre> <p>I use run trought the chunk with a main function until find a '>' character:</p> <pre><code>adrtampon=fstream[0]; i=0; while(adrtampon!='\0' ) { adrtampon=fstream[i]; if (adrtampon==ter) { sprintf(id,"%s",seekid((fstream+i))); if (checkidlist(id,tab,size)==0) { i++; fputc('&gt;',f2); adrtampon=fstream[i]; while (adrtampon!='\0' &amp;&amp; adrtampon!=ter) { fputc(adrtampon,f2); i++; adrtampon=fstream[i]; } i--; } } i++; } </code></pre> <p>and when i find '>' i first extract the id of the sequence between the two '|' and then i loop my librairy of intersting id with another simple function (similarly to the if ids in idlist) This function is then called with a python function still using multiprocessing class And finally... I obtain worst perf with the C code than the python code, even with one process. (When i deal directly with the file and not with chunks I obtain better perf with C but only with one process because of the concurent access to the file of the multiprocess (i think) ) Any suggestion to improve my C code and to explain why it is slower than the equivalent in python????? thanks a lot!! (especially if you have readen until here!)</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload