Note that there are some explanatory texts on larger screens.

plurals
  1. POGet HTML links within a specified <table> using minidom
    text
    copied!<p>I'm looking to use Python and xml.dom.minidom to get a list of links within a particular <code>&lt;table&gt;</code> specified by the table id. Based on some <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">excellent advice</a>, I'm trying to use the DOM instead of pattern matching.</p> <pre><code>import urllib import xml.dom.minidom url = 'http://www.batstrading.com/market_data/shortsales' page = xml.dom.minidom.parse(urllib.urlopen(url)) </code></pre> <p>I can get all the links by the tag name <code>page.getElementsByTagName('a')</code>, but I cannot limit the links returned by those only contained within the table with ID "monthly-short-sale". Using <code>getElementById</code> returns None.</p> <p>Is this because the "monthly-short-sale" ID is not defined within the DTD? If so, what would be the best way to extract this information?</p> <p>Here is the code that I'm currently using, which works, but sins against god:</p> <pre><code>import urllib import xml.dom.minidom import datetime url = 'http://www.batstrading.com/market_data/shortsales' def getDownloadLink(alink, prefix = 'BATSsh'): """return (datetime.date, link) for the provided link if the link target starts with the data file prefix""" n = len(prefix) href = alink.getAttribute('href') if href.startswith(prefix) and (len(href) == 25): year = int(href[n:n+4]) month = int(href[n+4:n+6]) day = int(href[n+6:n+8]) date = datetime.date(year, month, day) return (date, url + '/' + href) page = xml.dom.minidom.parse(urllib.urlopen(url)) link = (getDownloadLink(a) for a in page.getElementsByTagName('a')) link = dict(i for i in link if i is not None) </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload