Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Just to be clear, <code>(:content (data.xml/parse rdr :coalescing false))</code> IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.</p> <p>That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.</p> <p>Here's what I recommend:</p> <p>1) Add <code>(dorun)</code> to the end of the <code>-&gt;&gt;</code> chain of calls. This will force the sequence to be fully realized without holding onto the head.</p> <p>2) Change <code>for</code> in <code>process-page</code> to <code>doseq</code>. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.</p> <p>As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening &amp; writing (spit) for every Wikipedia entry.</p> <p><strong>UPDATE</strong>:</p> <p>Here's a rewrite which attempts to separate concerns more clearly:</p> <pre><code>(defn filter-tag [tag xml] (filter #(= tag (:tag %)) xml)) ;; lazy (defn revision-seq [xml] (for [page (filter-tag :page (:content xml)) :let [title (article-title page)] revision (filter-tag :revision (:content page)) :let [user (revision-user revision) time (revision-timestamp revision)]] [time user title])) ;; eager (defn transform [in out] (with-open [r (io/input-stream in) w (io/writer out)] (binding [*out* out] (let [xml (data.xml/parse r :coalescing false)] (doseq [[time user title] (revision-seq xml)] (println (str "\"" time "\";\"" user "\";\"" title "\"\n"))))))) (transform "dump.xml" "data.csv") </code></pre> <p>I don't see anything here that would cause excessive memory use.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload