Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>As noted in my comment to the question, this seems like a very odd file format, thus my original confusion.</p> <p>Note: if this is one of the standard biology file formats, you would probably be better off using <a href="http://www.biopython.org/DIST/docs/tutorial/Tutorial.html" rel="nofollow">Biopython</a> or something similar to parse it.</p> <p>If you want to do your own parsing, regular expressions still seem like the wrong way to go - difficult to read and unnecessary with a simple space/tab separated file. I'm assuming you have parsed your chromosome fasta file and have the sequences of all the chromosomes as a <code>chrom_sequences</code> name:seq dictionary, and also that you have a <code>reverse_complement</code> function (both of those things are easily implemented by hand but probably better done with biopython). </p> <pre><code>fields = line.split(' ') # or '\t' instead of ' ' if the file is tab-separated gene_ID, chr, strand = fields[:2] start, end = [int(x) for x in fields[3:5]) this_chromosome_seq = chrom_sequences[chr] # if strand is +, just take the sequence based on the start-end position if strand == '+': # be careful about off-by-one errors here - some formats give you a 1-based position, # other formats make it 0-based, and they can also be end-inclusive or end-exclusive gene_sequence = this_chromosome_seq[start:end] # if your coordinates really are given as antisense strand coordinates when strand is -, # you just need to subtract them from the chromosome length to get sense-strand coordinates, # (switching start and end so they're still in smaller-to-larger order), # and then reverse-complement the resulting sequence. if strand == '-': chrom_length = len(this_chromosome_seq) # again, be careful about off-by-one errors here! new_start,new_end = chrom_length-end, chrom_length-start gene_sequence = reverse_complement(this_chromosome_seq[new_start:new_end]) </code></pre> <p><strong>Original answer, didn't actually do what was asked:</strong></p> <p>If you just want to get the start/end positions, do something like this:</p> <pre><code>fields = line.split(' ') # or '\t' instead of ' ' if the file is tab-separated gene_ID, chr, strand = fields[:2] start = int(fields[3]) end = int(fields[4]) if strand == '-': start,end = end,start </code></pre> <p>Then you'll have to parse your fasta file to actually get the sequence for these start-end coordinates, and reverse-complement it if <code>strand=='-'</code>. Again, I think Biopython can do most of that for you.</p> <p>Another note - <a href="http://www.biostars.org" rel="nofollow">Biostar</a> is a good StackExchange-type site specifically for bioinformatics, you may get better answers there.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload