Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Your approach seems good enough, I would just give certain tags / attributes a weight and loop through them with XPath queries until I find something that exits and it's not void. Something like:</p> <pre><code>i = 0 while (//img[i][@src]) if (//img[i][@alt]) return alt else if (//img[i][@description]) return description else if (//img[i]/../p[0]) return p else return (//title) i++ </code></pre> <hr> <p>A simple XPath example (function <a href="https://github.com/alixaxel/phunction/blob/c6c380e7eb5ec14d153b2942c743271364d355ce/phunction/HTML.php#L36" rel="nofollow">ported from my framework</a>):</p> <pre><code>function ph_DOM($html, $xpath = null) { if (is_object($html) === true) { if (isset($xpath) === true) { $html = $html-&gt;xpath($xpath); } return $html; } else if (is_string($html) === true) { $dom = new DOMDocument(); if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); } if ($dom-&gt;loadHTML(ph()-&gt;Text-&gt;Unicode-&gt;mb_html_entities($html)) === true) { return ph_DOM(simplexml_import_dom($dom), $xpath); } } return false; } </code></pre> <p>And the actual usage:</p> <pre><code>$html = file_get_contents('http://en.wikipedia.org/wiki/Photography'); print_r(ph_DOM($html, '//img')); // gets all images print_r(ph_DOM($html, '//img[@src]')); // gets all images that have a src print_r(ph_DOM($html, '//img[@src]/..')); // gets all images that have a src and their parent element print_r(ph_DOM($html, '//img[@src]/../..')); // and so on... print_r(ph_DOM($html, '//title')); // get the title of the page </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload