Note that there are some explanatory texts on larger screens.

plurals
  1. POPeek into stream of Popen pipeline in Python
    primarykey
    data
    text
    <p><strong>Background:</strong><br> Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline.<br> I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using <code>gunzip FILENAME -c</code>) and if the first character of the stream (file) is "@", route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I'd like to minimize the number of file reads/seeks from remote storage (just a single pass through the file shouldn't be impossible?).</p> <p>Contents of an example input file, first four lines corresponding to one record in FASTQ format: </p> <pre><code>@I328_1_FC30MD2AAXX:8:1:1719:1113/1 GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG +I328_1_FC30MD2AAXX:8:1:1719:1113/1 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc </code></pre> <p>Files that should not be piped into the filtering program contain records that look like this (first two lines corresponding to one record in FASTA format): </p> <pre><code>&gt;I328_1_FC30MD2AAXX:8:1:1719:1113/1 GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG </code></pre> <p>Some made up semi-pseudo code effort to visualize what I want to do (I know this isn't possible the way I've written it). I hope it makes some sense:</p> <pre><code>if gzipped: gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE) if gunzip.stdout.peek(1) == "@": # This isn't possible fastq = True else: fastq = False if fastq: filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate() else: # Send the gunzipped stream to another file </code></pre> <p>Disregard the fact that the code won't run like I've written it here and that I have no error handling etc, all that is already in my other code. I just want help with peeking into the stream or finding a way around that. I would be great if you could <code>gunzip.stdout.peek(1)</code> but I realize that's not possible. </p> <p><strong>What I've tried so far:</strong><br> I figured subprocess.Popen might help me achieve this, and I've tried a lot of different ideas, amongst others trying to use some kind of io.BufferedRandom() object to write the stream to but I can't figure out how that would work. I know streams are non-seekable but maybe a workaround might be to read the first character of the gunzip-stream and then create a new stream where you first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream. This new stream would then be fed into filter's Popen stdin. </p> <p>Note that the file sizes might be several times larger than available memory. I do not want to perform more than one single read of the source file from remote storage and no unnecessary file accessing. </p> <p>Any ideas are welcome! Please ask me questions so I can clarify if I didn't make it clear enough.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload