Note that there are some explanatory texts on larger screens.

plurals
  1. POOptimising a Haskell XML parser
    text
    copied!<p>I'm experimenting with Haskell at the moment and am very much enjoying the experience, but am evaluating it for a real project with some fairly stringent performance requirements. The first pass of my task is to process a complete (no-history) dump of wikipedia (bzipped) - totalling about 6Gb compressed. In python a script to do a full extract of each raw page (about 10 million in total) takes about 30 minutes on my box (and for reference a scala implementation using the pull parser takes about 40 mins). I've been attempting to replicate this performance using Haskell and ghc and have been struggling to match this.</p> <p>I've been using Codec.Compression.BZip for decompression and hexpat for parsing. I'm using lazy bytestrings as the input to hexpat and strict bytestrings for the element text type. And to extract the text for each page I'm building up a Dlist of pointers to text elements and then iterating over this to dump it out to stdout. The code I've just described has already been through a number of profiling/refactor iterations (I quickly moved from strings to bytestrings, then from string concatenation to lists of pointers to text - then to dlists of pointers to text). I think I've got about 2 orders of magnitude speedup from the original code, but it still takes over an hour and a half to parse (although it has a lovely small memory footprint). So I'm looking for a bit of inspiration from the community to get me the extra mile. The code is below (and I've broken it up into a number of subfunctions in order to get more detail from the profiler). Please excuse my Haskell - I've only been coding for a couple of days (having spent a week with Real World Haskell). And thanks in advance!</p> <pre><code>import System.Exit import Data.Maybe import Data.List import Data.DList (DList) import qualified Data.DList as DList import Data.ByteString.Char8 (ByteString) import qualified Data.ByteString.Char8 as BS import qualified Data.ByteString.Lazy as LazyByteString import qualified Codec.Compression.BZip as BZip import Text.XML.Expat.Proc import Text.XML.Expat.Tree import Text.XML.Expat.Format testFile = "../data/enwiki-latest-pages-articles.xml.bz2" validPage pageData = case pageData of (Just _, Just _) -&gt; True (_, _) -&gt; False scanChildren :: [UNode ByteString] -&gt; DList ByteString scanChildren c = case c of h:t -&gt; DList.append (getContent h) (scanChildren t) [] -&gt; DList.fromList [] getContent :: UNode ByteString -&gt; DList ByteString getContent treeElement = case treeElement of (Element name attributes children) -&gt; scanChildren children (Text text) -&gt; DList.fromList [text] rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t) extractText page = do revision &lt;- findChild (BS.pack "revision") page text &lt;- findChild (BS.pack "text") revision return text pageDetails tree = let pageNodes = filterChildren relevantChildren tree in let getPageData page = (findChild (BS.pack "title") page, extractText page) in map rawData $ filter validPage $ map getPageData pageNodes where relevantChildren node = case node of (Element name attributes children) -&gt; name == (BS.pack "page") (Text _) -&gt; False outputPages pagesText = do let flattenedPages = map DList.toList pagesText mapM_ (mapM_ BS.putStr) flattenedPages readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName) parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError) main = do rawContent &lt;- readCompressed testFile let (tree, mErr) = parseXml rawContent let pages = pageDetails tree let pagesText = map snd pages outputPages pagesText putStrLn "Complete!" exitWith ExitSuccess </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload