Note that there are some explanatory texts on larger screens.

plurals
  1. POProcessing large XML file with libxml-ruby chunk by chunk
    primarykey
    data
    text
    <p>I'd like to read a large <a href="http://dblp.uni-trier.de/xml/" rel="nofollow noreferrer">XML</a> file that contains over a million small bibliographic records (like <code>&lt;article&gt;...&lt;/article&gt;</code>) using libxml in Ruby. I have tried the Reader class in combination with the <code>expand</code> method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:</p> <pre><code> File.open('dblp.xml') do |io| dblp = XML::Reader.io(io, :options =&gt; XML::Reader::SUBST_ENTITIES) pubFactory = PubFactory.new i = 0 while dblp.read do case dblp.name when 'article', 'inproceedings', 'book': pub = pubFactory.create(dblp.expand) i += 1 puts pub pub = nil $stderr.puts i if i % 10000 == 0 dblp.next when 'proceedings','incollection', 'phdthesis', 'mastersthesis': # ignore for now dblp.next else # nothing end end end </code></pre> <p>The key here is that <code>dblp.expand</code> reads an entire subtree (like an <code>&lt;article&gt;</code> record) and passes it as an argument to a factory for further processing. Is this the right approach? </p> <p>Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?</p> <pre><code>def first(root, node) x = root.find(node).first x ? x.content : nil end pub.pages = first(node,'pages') # node contains expanded node from dblp.expand </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload