StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>After running your program I get somewhat weird results:</p> <blockquote> <pre> ./wikiparse +RTS -s -A5m -H5m | tail ./wikiparse +RTS -s -A5m -H5m 3,604,204,828,592 bytes allocated in the heap 70,746,561,168 bytes copied during GC 39,505,112 bytes maximum residency (37822 sample(s)) 2,564,716 bytes maximum slop 83 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 620343 collections, 0 parallel, 15.84s, 368.69s elapsed Generation 1: 37822 collections, 0 parallel, 1.08s, 33.08s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 243.85s (4003.81s elapsed) GC time 16.92s (401.77s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 260.77s (4405.58s elapsed) %GC time 6.5% (9.1% elapsed) Alloc rate 14,780,341,336 bytes per MUT second Productivity 93.5% of total user, 5.5% of total elapsed </pre> </blockquote> <p>Total time is more than OK I think: 260s is way faster than 30m for Python. I have no idea though why the overall time is so big here. I really don't think that reading 6Gb file would take more than an hour to complete.</p> <p>I'm running your program again to check if the results are consistent.</p> <p>If the result of those 4'20'' is right, then I believe something is wrong with the machine... or there is some other strange effect here.</p> <p>The code was compiled on GHC 7.0.2.</p> <hr> <p>Edit: I tried various versions of the program above. The most important optimization seems to be {-# INLINE #-} pragma and specialization of functions. Some have pretty generic types, which is known to be bad for performance. OTOH I believe inlining should be enough to trigger the specialization, so you should try to experiment further with this.</p> <p>I didn't see any significant difference across the versions of GHC I tried (6.12 .. HEAD).</p> <p>Haskell bindings to bzlib seems to have optimal performance. The following program, which is near-complete reimplementation of standard <code>bzcat</code> program, is as fast or even faster than the original.</p> <pre><code>module Main where import qualified Data.ByteString.Lazy as BSL import qualified Codec.Compression.BZip as BZip import System.Environment (getArgs) readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName) main :: IO () main = do files <- getArgs mapM_ (\f -> readCompressed f >>= BSL.putStr) files </code></pre> <p>On my machine it takes ~1100s to decompress the test file to <code>/dev/null</code>. The fastest version I was able to get was based on SAX style parser. I'm not sure though if the output matches that of the original. On small outputs the result is the same, and so is the performance. On the original file the SAX version is somewhat faster and completes in ~2400s. You can find it below.</p> <pre><code>{-# LANGUAGE OverloadedStrings #-} import System.Exit import Data.Maybe import Data.ByteString.Char8 (ByteString) import qualified Data.ByteString as BS import qualified Data.ByteString.Lazy as BSL import qualified Codec.Compression.BZip as BZip import System.IO import Text.XML.Expat.SAX as SAX type ByteStringL = BSL.ByteString type Token = ByteString type TokenParser = [SAXEvent Token Token] -> [[Token]] testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2" readCompressed :: FilePath -> IO ByteStringL readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName) {-# INLINE pageStart #-} pageStart :: TokenParser pageStart ((StartElement "page" _):xs) = titleStart xs pageStart (_:xs) = pageStart xs pageStart [] = [] {-# INLINE titleStart #-} titleStart :: TokenParser titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs titleStart ((EndElement "page"):xs) = pageStart xs titleStart (_:xs) = titleStart xs titleStart [] = error "could not find <title>" {-# INLINE revisionStart #-} revisionStart :: TokenParser revisionStart ((StartElement "revision" _):xs) = textStart xs revisionStart ((EndElement "page"):xs) = pageStart xs revisionStart (_:xs) = revisionStart xs revisionStart [] = error "could not find <revision>" {-# INLINE textStart #-} textStart :: TokenParser textStart ((StartElement "text" _):xs) = textNode [] xs textStart ((EndElement "page"):xs) = pageStart xs textStart (_:xs) = textStart xs textStart [] = error "could not find <text>" {-# INLINE textNode #-} textNode :: [Token] -> TokenParser textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs textNode acc xs = (reverse acc) : textEnd xs {-# INLINE textEnd #-} textEnd {- , revisionEnd, pageEnd -} :: TokenParser textEnd = finish "text" . finish "revision" . finish "page" $ pageStart --revisionEnd = finish "revision" pageEnd --pageEnd = finish "page" pageStart {-# INLINE finish #-} finish :: Token -> TokenParser -> TokenParser finish tag cont ((EndElement el):xs) | el == tag = cont xs finish tag cont (_:xs) = finish tag cont xs finish tag _ [] = error (show (tag,("finish []" :: String))) main :: IO () main = do rawContent <- readCompressed testFile let parsed = (pageStart (SAX.parse defaultParseOptions rawContent)) mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early putStrLn "Complete!" </code></pre> <p>Generally I'm suspicious that Python's and Scala's versions are finishing early. I couldn't verify that claim though without the source code.</p> <p>To sum up: inlining and specialization should give reasonable, about two-fold increase in performance.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload