Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to Parse a huge xml file (on the go) using Python
    text
    copied!<p>I have a huge xml file (the current <a href="http://dumps.wikimedia.org/" rel="nofollow">wikipedia dump</a>). This xml having a size of about 45 GB represents the entire data of the current wikipedia. The first few lines of the file are (output of more):</p> <pre><code> &lt;mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://ww w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/x ml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:la ng="en"&gt; &lt;siteinfo&gt; &lt;sitename&gt;Wikipedia&lt;/sitename&gt; &lt;base&gt;http://en.wikipedia.org/wiki/Main_Page&lt;/base&gt; &lt;generator&gt;MediaWiki 1.21wmf6&lt;/generator&gt; &lt;case&gt;first-letter&lt;/case&gt; &lt;namespaces&gt; &lt;namespace key="-2" case="first-letter"&gt;Media&lt;/namespace&gt; &lt;namespace key="-1" case="first-letter"&gt;Special&lt;/namespace&gt; &lt;namespace key="0" case="first-letter" /&gt; &lt;namespace key="1" case="first-letter"&gt;Talk&lt;/namespace&gt; &lt;namespace key="2" case="first-letter"&gt;User&lt;/namespace&gt; &lt;namespace key="3" case="first-letter"&gt;User talk&lt;/namespace&gt; &lt;namespace key="4" case="first-letter"&gt;Wikipedia&lt;/namespace&gt; &lt;namespace key="5" case="first-letter"&gt;Wikipedia talk&lt;/namespace&gt; &lt;namespace key="6" case="first-letter"&gt;File&lt;/namespace&gt; &lt;namespace key="7" case="first-letter"&gt;File talk&lt;/namespace&gt; &lt;namespace key="8" case="first-letter"&gt;MediaWiki&lt;/namespace&gt; &lt;namespace key="9" case="first-letter"&gt;MediaWiki talk&lt;/namespace&gt; &lt;namespace key="10" case="first-letter"&gt;Template&lt;/namespace&gt; &lt;namespace key="11" case="first-letter"&gt;Template talk&lt;/namespace&gt; &lt;namespace key="12" case="first-letter"&gt;Help&lt;/namespace&gt; &lt;namespace key="13" case="first-letter"&gt;Help talk&lt;/namespace&gt; &lt;namespace key="14" case="first-letter"&gt;Category&lt;/namespace&gt; &lt;namespace key="15" case="first-letter"&gt;Category talk&lt;/namespace&gt; &lt;namespace key="100" case="first-letter"&gt;Portal&lt;/namespace&gt; &lt;namespace key="101" case="first-letter"&gt;Portal talk&lt;/namespace&gt; &lt;namespace key="108" case="first-letter"&gt;Book&lt;/namespace&gt; &lt;namespace key="109" case="first-letter"&gt;Book talk&lt;/namespace&gt; &lt;namespace key="446" case="first-letter"&gt;Education Program&lt;/namespace&gt; &lt;namespace key="447" case="first-letter"&gt;Education Program talk&lt;/namespace &gt; &lt;namespace key="710" case="first-letter"&gt;TimedText&lt;/namespace&gt; &lt;namespace key="711" case="first-letter"&gt;TimedText talk&lt;/namespace&gt; &lt;/namespaces&gt; &lt;/siteinfo&gt; &lt;page&gt; &lt;title&gt;AccessibleComputing&lt;/title&gt; &lt;ns&gt;0&lt;/ns&gt; &lt;id&gt;10&lt;/id&gt; &lt;redirect title="Computer accessibility" /&gt; &lt;revision&gt; &lt;id&gt;381202555&lt;/id&gt; &lt;parentid&gt;381200179&lt;/parentid&gt; &lt;timestamp&gt;2010-08-26T22:38:36Z&lt;/timestamp&gt; &lt;contributor&gt; &lt;username&gt;OlEnglish&lt;/username&gt; &lt;id&gt;7181920&lt;/id&gt; &lt;/contributor&gt; &lt;minor /&gt; &lt;comment&gt;[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.2 8.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch&lt;/comment&gt; &lt;text xml:space="preserve"&gt;#REDIRECT [[Computer accessibility]] {{R from C amelCase}}&lt;/text&gt; &lt;sha1&gt;lo15ponaybcg2sf49sstw9gdjmdetnk&lt;/sha1&gt; &lt;model&gt;wikitext&lt;/model&gt; </code></pre> <p>...and so on </p> <p>Notice the <strong>page</strong> element in the tree. It corresponds to a unique page in Wikipedia. The given XML consists of all the pages of Wikipedia in the form of page elements. I need to write a parser where in I need to extract the value of title entry from the page for all pages of wikipedia and suppose (for simplicity) print them.</p> <p>I am trying to build the same using Python (although I am open to a switch in language if that offers a solution). The only way I know of is to use <a href="http://docs.python.org/2/library/xml.etree.elementtree.html" rel="nofollow">ElementTree</a>. </p> <p>However, using the function parse('file.xml') requires the entire document to first be parsed completely and THEN will any results be outputted. As is evident, I <em>know</em> that the entire xml consist of page elements. I want the program to begin printing titles WHILE it is parsing the rest of the xml. Is that even possible. If so, how?</p> <p>EDIT Note: I cite an example of extracting titles here to keep things simple in the question. However, I do need the xml parsing features since I need to extract the same in future. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload