StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I think nested iteratees are the correct approach, but this case has some unique problems which make it slightly different from most common examples.</p> <p><strong>Chunks and groups</strong></p> <p>The first problem is to get the data source right. Basically the logical divisions you've described would give you a stream equivalent to <code>[[ByteString]]</code>. If you create an enumerator to produce this directly, each element within the stream would be a full group of chunks, which presumably you wish to avoid (for memory reasons). You could flatten everything into a single <code>[ByteString]</code>, but then you'd need to re-introduce boundaries, which would be pretty wasteful since the db is doing it for you.</p> <p>Ignoring the stream of groups for now, it appears that you need to divide the data into chunks yourself. I would model this as:</p> <pre><code>enumGroup :: Enumerator ByteString IO a enumGroup = enumFromCallback cb () where cb () = do (code, data) <- getResultData case code of OPERATION_SUCCEEDED -> return $ Right ((True, ()), data) NO_MORE_DATA -> return $ Right ((False, ()), data) GET_DATA_FAILED -> return $ Left MyException </code></pre> <p>Since chunks are of a fixed size, you can easily chunk this with <code>Data.Iteratee.group</code>.</p> <pre><code>enumGroupChunked :: Iteratee [ByteString] IO a -> IO (Iteratee ByteString IO a) enumGroupChunked = enumGroup . joinI . group groupSize </code></pre> <p>Compare the type of this to <code>Enumerator</code></p> <pre><code>type Enumerator s m a = Iteratee s m a -> m (Iteratee s m a) </code></pre> <p>So <code>enumGroupChunked</code> is basically a fancy enumerator which changes the stream type. This means that it takes a [ByteString] iteratee consumer, and returns an iteratee which consumes plain bytestrings. Often the return type of an enumerator doesn't matter; it's simply an iteratee which you evaluate with <code>run</code> (or <code>tryRun</code>) to get at the output, so you could do the same here:</p> <pre><code>evalGroupChunked :: Iteratee [ByteString] IO a -> IO a evalGroupChunked i = enumGroupChunked i >>= run </code></pre> <p>If you have more complicated processing to do on each group, the easiest place to do so would be in the <code>enumGroupChunked</code> function.</p> <p><strong>Stream of groups</strong></p> <p>Now this is out of the way, what to do about the stream of groups? The answer depends on how you want to consume them. If you want to essentially treat each group in the stream independently, I would do something similar to this:</p> <pre><code>foldStream :: Iteratee [ByteString] IO a -> (b -> a -> b) -> b -> IO b foldStream iter f acc0 = do val <- evalGroupChunked iter res <- getNextItem case res of OPERATION_SUCCEEDED -> foldStream iter f $! f acc0 val NO_MORE_DATA -> return $ f acc0 val GET_DATA_FAILED -> error "had a problem" </code></pre> <p>However, let's say you want to do some sort of stream processing of the entire dataset, not just individual groups. That is, you have a</p> <pre><code>bigProc :: Iteratee [ByteString] IO a </code></pre> <p>that you want to run over the entire dataset. This is where the return iteratee of an enumerator is useful. Some earlier code will be slightly different now:</p> <pre><code>enumGroupChunked' :: Iteratee [ByteString] IO a -> IO (Iteratee ByteString IO (Iteratee [ByteString] IO a)) enumGroupChunked' = enumGroup . group groupSize procStream :: Iteratee [ByteString] IO a -> a procStream iter = do i' <- enumGroupChunked' iter >>= run res <- getNextItem case res of OPERATION_SUCCEEDED -> procStream i' NO_MORE_DATA -> run i' GET_DATA_FAILED -> error "had a problem" </code></pre> <p>This usage of nested iteratees (i.e. <code>Iteratee s1 m (Iteratee s2 m a)</code>) is slightly uncommon, but it's particularly helpful when you want to sequentially process data from multiple Enumerators. The key is to recognize that <code>run</code>ing the outer iteratee will give you an iteratee which is ready to receive more data. It's a model that works well in this case, because you can enumerate each group independently but process them as a single stream.</p> <p>One caution: the inner iteratee will be in whatever state it was left in. Suppose that the last chunk of a group may be smaller than a full chunk, e.g.</p> <pre><code> Group A Group B Group C 1024, 1024, 512 1024, 1024, 1024 1024, 1024, 1024 </code></pre> <p>What will happen in this case is that, because <code>group</code> is combining data into chunks of size 1024, it will combine the last chunk of Group A with the first 512 bytes of Group B. This isn't a problem with the <code>foldStream</code> example because that code terminates the inner iteratee (with <code>joinI</code>). That means the groups are truly independent, so you have to treat them as such. If you want to combine the groups as in <code>procStream</code>, you have to think of the entire stream. If this is your case, then you'll need to use something more sophisticated than just <code>group</code>.</p> <p><strong>Data.Iteratee vs Data.Enumerator</strong></p> <p>Without getting into a debate of the merits of either package, not to mention <a href="http://hackage.haskell.org/package/iterIO" rel="nofollow">IterIO</a> (I'm admittedly biased), I would like to point out what I consider the most significant difference between the two: the abstraction of the stream.</p> <p>In Data.Iteratee, a consumer <code>Iteratee ByteString m a</code> operates on a notional ByteString of some length, with access to a single chunk of <code>ByteString</code> at one time.</p> <p>In Data.Enumerator, a consumer <code>Iteratee ByteString m a</code> operates on a notional [ByteString], with access to one or more elements (bytestrings) at one time.</p> <p>This means that most Data.Iteratee operations are element-focused, that is with an <code>Iteratee ByteString</code> they'll operate on a single <code>Word8</code>, whereas Data.Enumerator operations are chunk-focused, operating on a <code>ByteString</code>.</p> <p>You can think of <code>Data.Iteratee.Iteratee [s] m a</code> === <code>Data.Enumerator.Iteratee s m a</code>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload