Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to extract the first hit elements from an XML NCBI BLAST file?
    primarykey
    data
    text
    <p>Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score. to make things clear here a sample of the xml file:</p> <pre><code>&lt;?xml version="1.0"?&gt; &lt;!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"&gt; &lt;BlastOutput&gt; &lt;BlastOutput_program&gt;blastx&lt;/BlastOutput_program&gt; &lt;BlastOutput_version&gt;blastx 2.2.22 [Sep-27-2009]&lt;/BlastOutput_version&gt; &lt;BlastOutput_reference&gt;~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&amp;quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs&amp;quot;, Nucleic Acids Res. 25:3389-3402.&lt;/BlastOutput_reference&gt; &lt;BlastOutput_db&gt;/Applications/blast/db/viral1.protein.faa&lt;/BlastOutput_db&gt; &lt;BlastOutput_query-ID&gt;lcl|1_0&lt;/BlastOutput_query-ID&gt; &lt;BlastOutput_query-def&gt;DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev&lt;/BlastOutput_query-def&gt; &lt;BlastOutput_query-len&gt;1024&lt;/BlastOutput_query-len&gt; &lt;BlastOutput_param&gt; &lt;Parameters&gt; &lt;Parameters_matrix&gt;BLOSUM62&lt;/Parameters_matrix&gt; &lt;Parameters_expect&gt;1e-05&lt;/Parameters_expect&gt; &lt;Parameters_gap-open&gt;11&lt;/Parameters_gap-open&gt; &lt;Parameters_gap-extend&gt;1&lt;/Parameters_gap-extend&gt; &lt;Parameters_filter&gt;F&lt;/Parameters_filter&gt; &lt;/Parameters&gt; &lt;/BlastOutput_param&gt; &lt;BlastOutput_iterations&gt; &lt;Iteration&gt; &lt;Iteration_iter-num&gt;1&lt;/Iteration_iter-num&gt; &lt;Iteration_query-ID&gt;lcl|1_0&lt;/Iteration_query-ID&gt; &lt;Iteration_query-def&gt;DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev&lt;/Iteration_query-def&gt; &lt;Iteration_query-len&gt;1024&lt;/Iteration_query-len&gt; &lt;Iteration_stat&gt; &lt;Statistics&gt; &lt;Statistics_db-num&gt;68007&lt;/Statistics_db-num&gt; &lt;Statistics_db-len&gt;19518578&lt;/Statistics_db-len&gt; &lt;Statistics_hsp-len&gt;0&lt;/Statistics_hsp-len&gt; &lt;Statistics_eff-space&gt;0&lt;/Statistics_eff-space&gt; &lt;Statistics_kappa&gt;0.041&lt;/Statistics_kappa&gt; &lt;Statistics_lambda&gt;0.267&lt;/Statistics_lambda&gt; &lt;Statistics_entropy&gt;0.14&lt;/Statistics_entropy&gt; &lt;/Statistics&gt; &lt;/Iteration_stat&gt; &lt;Iteration_message&gt;No hits found&lt;/Iteration_message&gt; &lt;/Iteration&gt; &lt;Iteration&gt; &lt;Iteration&gt; &lt;Iteration_iter-num&gt;6&lt;/Iteration_iter-num&gt; &lt;Iteration_query-ID&gt;lcl|6_0&lt;/Iteration_query-ID&gt; &lt;Iteration_query-def&gt;DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev&lt;/Iteration_query-def&gt; &lt;Iteration_query-len&gt;1068&lt;/Iteration_query-len&gt; &lt;Iteration_hits&gt; &lt;Hit&gt; &lt;Hit_num&gt;1&lt;/Hit_num&gt; &lt;Hit_id&gt;gnl|BL_ORD_ID|23609&lt;/Hit_id&gt; &lt;Hit_def&gt;gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]&lt;/Hit_def&gt; &lt;Hit_accession&gt;23609&lt;/Hit_accession&gt; &lt;Hit_len&gt;293&lt;/Hit_len&gt; &lt;Hit_hsps&gt; &lt;Hsp&gt; &lt;Hsp_num&gt;1&lt;/Hsp_num&gt; &lt;Hsp_bit-score&gt;49.2914&lt;/Hsp_bit-score&gt; &lt;Hsp_score&gt;116&lt;/Hsp_score&gt; &lt;Hsp_evalue&gt;5.15408e-06&lt;/Hsp_evalue&gt; &lt;Hsp_query-from&gt;580&lt;/Hsp_query-from&gt; &lt;Hsp_query-to&gt;792&lt;/Hsp_query-to&gt; &lt;Hsp_hit-from&gt;202&lt;/Hsp_hit-from&gt; &lt;Hsp_hit-to&gt;273&lt;/Hsp_hit-to&gt; &lt;Hsp_query-frame&gt;-1&lt;/Hsp_query-frame&gt; &lt;Hsp_identity&gt;26&lt;/Hsp_identity&gt; &lt;Hsp_positive&gt;45&lt;/Hsp_positive&gt; &lt;Hsp_gaps&gt;2&lt;/Hsp_gaps&gt; &lt;Hsp_align-len&gt;73&lt;/Hsp_align-len&gt; &lt;Hsp_qseq&gt;MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI&lt;/Hsp_qseq&gt; &lt;Hsp_hseq&gt;MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL&lt;/Hsp_hseq&gt; &lt;Hsp_midline&gt;M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++&lt;/Hsp_midline&gt; &lt;/Hsp&gt; &lt;/Hit_hsps&gt; &lt;/Hit&gt; &lt;Hit&gt; &lt;Hit_num&gt;2&lt;/Hit_num&gt; &lt;Hit_id&gt;gnl|BL_ORD_ID|2466&lt;/Hit_id&gt; &lt;Hit_def&gt;gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]&lt;/Hit_def&gt; &lt;Hit_accession&gt;2466&lt;/Hit_accession&gt; &lt;Hit_len&gt;3084&lt;/Hit_len&gt; &lt;Hit_hsps&gt; &lt;Hsp&gt; &lt;Hsp_num&gt;1&lt;/Hsp_num&gt; &lt;Hsp_bit-score&gt;48.9062&lt;/Hsp_bit-score&gt; &lt;Hsp_score&gt;115&lt;/Hsp_score&gt; &lt;Hsp_evalue&gt;6.70494e-06&lt;/Hsp_evalue&gt; &lt;Hsp_query-from&gt;369&lt;/Hsp_query-from&gt; &lt;Hsp_query-to&gt;875&lt;/Hsp_query-to&gt; &lt;Hsp_hit-from&gt;2312&lt;/Hsp_hit-from&gt; &lt;Hsp_hit-to&gt;2465&lt;/Hsp_hit-to&gt; &lt;Hsp_query-frame&gt;-2&lt;/Hsp_query-frame&gt; &lt;Hsp_identity&gt;52&lt;/Hsp_identity&gt; &lt;Hsp_positive&gt;70&lt;/Hsp_positive&gt; &lt;Hsp_gaps&gt;4&lt;/Hsp_gaps&gt; &lt;Hsp_align-len&gt;173&lt;/Hsp_align-len&gt; &lt;Hsp_qseq&gt;APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV&lt;/Hsp_qseq&gt; &lt;Hsp_hseq&gt;APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV&lt;/Hsp_hseq&gt; &lt;Hsp_midline&gt;AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V&lt;/Hsp_midline&gt; &lt;/Hsp&gt; &lt;/Hit_hsps&gt; &lt;/Hit&gt; &lt;/Iteration_hits&gt; &lt;Iteration_stat&gt; &lt;Statistics&gt; &lt;Statistics_db-num&gt;68007&lt;/Statistics_db-num&gt; &lt;Statistics_db-len&gt;19518578&lt;/Statistics_db-len&gt; &lt;Statistics_hsp-len&gt;0&lt;/Statistics_hsp-len&gt; &lt;Statistics_eff-space&gt;0&lt;/Statistics_eff-space&gt; &lt;Statistics_kappa&gt;0.041&lt;/Statistics_kappa&gt; &lt;Statistics_lambda&gt;0.267&lt;/Statistics_lambda&gt; &lt;Statistics_entropy&gt;0.14&lt;/Statistics_entropy&gt; &lt;/Statistics&gt; &lt;/Iteration_stat&gt; &lt;/Iteration&gt; </code></pre> <p>basically each query search creates an Iteration element. each iteration can have multiple hit which in turn can have multiple HSPs. I would like to get only the first hit and it's first HSP from each iteration. if the BLAST found no hits I would like to ignore the iteration. I worked up this simple code:</p> <pre><code>#!/usr/bin/env python from elementtree.ElementTree import parse from elementtree import ElementTree as ET file = open("/Applications/blast/blanes_viral_nr_results.xml", "r") save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w') tree = parse(file) elem = tree.getroot() print elem Per_ID = () save_file.write('&gt;%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t")) iteration = tree.findall('BlastOutput_iterations/Iteration') for iteration in iteration: for hit in iteration.findall('Iteration_hits/Hit'): It_Num = iteration.findtext('Iteration_iter-num') It_ID = iteration.findtext('Iteration_query-def') Hit_Def = hit.findtext('Hit_def') Num = hit.findtext('Hit_num') ID = hit.findtext('Hit_id') DEF = hit.findtext('Hit_def') ACC = hit.findtext('Hit_accession') save_file.write('&gt;%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,)) for hsp in hit.findall('Hit_hsps'): HSPN = hsp.findtext('Hsp/Hsp_num') identities = hsp.findtext('Hsp/Hsp_identity') #print 'id: ', identities.rjust(4), length = hsp.findtext('Hsp/Hsp_align-len') #print 'len:', length.rjust(4), Per_ID = int(identities) * 100.0 / int(length) #print hsp.findtext('Hsp/Hsp_qseq')[:50] #print hsp.findtext('Hsp/Hsp_midline')[:50] #print hsp.findtext('Hsp/Hsp_hseq')[:50] save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID)) save_file.write('n\n' % ()) </code></pre> <p>any help would be greatly appriciated!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload