StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POParsing HTML the DOM way
text
Body
copied!<p>we've got an ancient (internal) website with static info. We're going to replace it with something better, therefore I need to fetch all info. I <em>used</em> to do this via regex, but lately I stumbled about a few articles stating that using regex to parse info from HTML is <a href="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html" rel="nofollow">inviting cthulhu to this realm</a>.</p> <p>So I decided to learn a few new tricks, start over and do it the DOM-way. the HTML part I need looks like this:</p> <pre><code><table id="articles"> <tr> <th> <a href='articles/aa123.html'><img src="/iamges/aa123.jpg" alt="some article"></a> <br />short description </th> <td> <table class='details'> <tr><th><a href='articles/aa123.html'>Some Article</a></th></tr> <tr><th>Type:</th><td>article type</td></tr> <tr><th>Price:</th><td>€ 99</td></tr> <tr><th>Manufacturer:</th><td>Some Company</td></tr> <tr><th>Warehouse:</th><td>x</td></tr> </table> </td> </tr> </table> </code></pre> <p>And so far I got this:</p> <pre><code>$dom = new DOMDocument(); @$dom->loadHTMLFile ($file); $xpath = new DOMXPath($dom); $query = "/html/body/table[@id='articles']//th"; //catch all TH's $data = $xpath->evaluate($query); </code></pre> <p>And that's about where I get stuck. I know all content of the returned TH's is in the ChildNodes, but I'm having a hard time getting the values. I need the URL to the details page and the value for the Price column.</p> <p>How do I get those extracted?</p> <p>Currently I came up with the following:</p> <pre><code>$query = '//table[@class="details"]//td'; $data= $xpath->evaluate($query); $c = $ths->length; for ($i = 0; $i < $c; $i++) { echo htmlentities($data->item($i)->nodeValue); } </code></pre> <p>But this only displays the text values from the TD's. When the content is a link, it only show the link-title. Not the URL.</p> <p><strong>UPDATE</strong> Thanks to Fab's suggestion I managed to book some progress. Currently I got the following:</p> <pre><code>$tables = $xpath->query('//table[@class="details"]'); foreach($tables as $table) { $url = $xpath->evaluate('//th/a/@href', $table); $articleName= $xpath->evaluate('//th/a', $table); $Manufacturer= $xpath->evaluate('//th[text()="Manufacturer:"]/../td', $table); echo 'articleName:' . $articleName . ' <br />'; echo 'Manufacturer:' . $Manufacturer. ' <br />'; echo 'url:' . $url. ' <br />'; echo '<br />'; } </code></pre> <p>But for some reason it always displays the data from the first acticle (repeated for as many articles as there are on the page). As if the 'foreach' statement always returns the 1st found table. Any tips?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload