Note that there are some explanatory texts on larger screens.

plurals
  1. POOutOfMemoryError when parsing XML in Clojure with data.zip
    text
    copied!<p>I want to use Clojure to extract the titles from a Wiktionary XML dump.</p> <p>I used <code>head -n10000 &gt; out-10000.xml</code> to create smaller versions of the original monster file. Then I trimmed with a text editor to make it valid XML. I renamed the files according to the number of lines inside (<code>wc -l</code>):</p> <pre><code>(def data-9764 "data/wiktionary-en-9764.xml") ; 354K (def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M (def data-995066 "data/wiktionary-en-995066.xml") ; 34M (def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M </code></pre> <p>Here is the overview of the XML structure:</p> <pre><code>&lt;mediawiki&gt; &lt;page&gt; &lt;title&gt;dictionary&lt;/title&gt; &lt;revision&gt; &lt;id&gt;20100608&lt;/id&gt; &lt;parentid&gt;20056528&lt;/parentid&gt; &lt;timestamp&gt;2013-04-06T01:14:29Z&lt;/timestamp&gt; &lt;text xml:space="preserve"&gt; ... &lt;/text&gt; &lt;/revision&gt; &lt;/page&gt; &lt;/mediawiki&gt; </code></pre> <p>Here is what I've tried, based on <a href="https://stackoverflow.com/a/9595315/109618">this answer to 'Clojure XML Parsing'</a>:</p> <pre><code>(ns example.core (:use [clojure.data.zip.xml :only (attr text xml-&gt;)]) (:require [clojure.xml :as xml] [clojure.zip :as zip])) (defn titles "Extract titles from +filename+" [filename] (let [xml (xml/parse filename) zipped (zip/xml-zip xml)] (xml-&gt; zipped :page :title text))) (count (titles data-9764)) ; 38 (count (titles data-99224)) ; 779 (count (titles data-995066)) ; 5172 (count (titles data-7999931)) ; OutOfMemoryError Java heap space java.util.Arrays.copyOfRange (Arrays.java:3209) </code></pre> <p>Am I doing something wrong in my code? Or is this perhaps a bug or limitation in the libraries I'm using? Based on REPL experimentation, it seems like the code I'm using is lazy. Underneath, Clojure uses a SAX XML parser, so that alone should not be the problem.</p> <p>See also:</p> <ul> <li><a href="https://stackoverflow.com/questions/11213083/does-clojure-xml-parse-return-a-lazy-sequence">Does clojure-xml/parse return a lazy sequence?</a></li> <li><a href="https://stackoverflow.com/questions/9939844/huge-xml-in-clojure">Huge XML in Clojure</a></li> </ul> <p><strong>Update 2013-04-30:</strong></p> <p>I'd like to share some discussion from the clojure IRC channel. I've pasted an edited version below. (I removed the user names, but if you want credit, just let me know; I'll edit and give you a link.)</p> <blockquote> <p>The entire tag is read into memory at once in <code>xml/parse</code>, long before you even call count. And <code>clojure.xml</code> uses the ~lazy SAX parser to produce an eager concrete collection. Processing XML lazily requires a lot more work than you think - and it would be work <em>you</em> do, not some magic <code>clojure.xml</code> could do for you. Feel free to disprove by calling <code>(count (xml/parse data-whatever))</code>.</p> </blockquote> <p>To summarize, even before using <code>zip/xml-zip</code>, this <code>xml/parse</code> causes an <code>OutOfMemoryError</code> with a large enough file:</p> <pre><code>(count (xml/parse filename)) </code></pre> <p>At present, I am exploring other XML processing options. At the top of my list is <a href="https://github.com/clojure/data.xml" rel="nofollow noreferrer">clojure.data.xml</a> as mentioned at <a href="https://stackoverflow.com/a/9946054/109618">https://stackoverflow.com/a/9946054/109618</a>.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload