Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Using regex to parse HTML is probably not the best way to go.</p> <p>You might take a look at <a href="http://php.net/manual/en/domdocument.loadhtml.php" rel="nofollow noreferrer">DOMDocument::loadHTML</a>, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).</p> <p>You might also want to take a look at <a href="http://framework.zend.com/manual/en/zend.dom.html" rel="nofollow noreferrer"><code>Zend_Dom</code></a> and <a href="http://framework.zend.com/manual/en/zend.dom.query.html" rel="nofollow noreferrer"><code>Zend_Dom_Query</code></a>, btw, which are quite nice if you can use some parts of Zend Framework in your application. <br> They are used to get fetch data from HTML pages when doing functionnal testing with <a href="http://framework.zend.com/manual/en/zend.test.html" rel="nofollow noreferrer"><code>Zend_Test</code></a>, for instance -- and work quite well ;-)</p> <p>It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...</p> <hr> <p><strong>EDIT after the comment and the edit of the OP</strong></p> <p>Here are a couple of thought about, to begin by something "simple", an input tag :</p> <ul> <li>it can spread accross several lines</li> <li>it can have many attributes</li> <li>condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order</li> <li>attributes can have double-quotes, single-quotes, or even nothing arround their values</li> <li>tags / attributes can be both lower-case or upper-case</li> <li>tags don't always have to be closed</li> </ul> <p>Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...</p> <p>Only with those points, I wouldn't like to be the one writting the regex ^^ <br>But I suppose there might be others difficulties I didn't think about.</p> <p><br> On the other side, you have DOM and xpath... To get the value of an input name="q" (example is <a href="http://www.google.fr/search?q=test&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=com.ubuntu:en-US:unofficial&amp;client=firefox-a" rel="nofollow noreferrer">this page</a>), it's a matter of something like this :</p> <pre><code>$url = 'http://www.google.fr/search?q=test&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=com.ubuntu:en-US:unofficial&amp;client=firefox-a'; $html = file_get_contents($url); $dom = new DOMDocument(); if (@$dom-&gt;loadHTML($html)) { // yep, not necessarily valid-html... $xpath = new DOMXpath($dom); $nodeList = $xpath-&gt;query('//input[@name="q"]'); if ($nodeList-&gt;length &gt; 0) { for ($i=0 ; $i&lt;$nodeList-&gt;length ; $i++) { $node = $nodeList-&gt;item($i); var_dump($node-&gt;getAttribute('value')); } } } else { // too bad... } </code></pre> <p>What matters here ? The XPath query, and only that... And is there anything static/constant in it ? <br>Well, I say I want all <code>&lt;input&gt;</code> that have a <code>name</code> attribute that is equal to "<code>q</code>". <br>And it just works : I'm getting this result :</p> <pre><code>string 'test' (length=4) string 'test' (length=4) </code></pre> <p><em>(I checked : there are two input name="q" on the page ^^ )</em></p> <p>Do I know the structure of the page ? Absolutly not ;-) <br>I just know I/you/we want input tags named q ;-)</p> <p>And that's what we get ;-)</p> <hr> <p><strong>EDIT 2 : and a bit fun with select and options :</strong></p> <p>Well, just for fun, here's what I came up for select and option :</p> <pre><code>$url = 'http://www.google.fr/language_tools?hl=fr'; $html = file_get_contents($url); $dom = new DOMDocument(); if (@$dom-&gt;loadHTML($html)) { // yep, not necessarily valid-html... $xpath = new DOMXpath($dom); $nodeListSelects = $xpath-&gt;query('//select'); if ($nodeListSelects-&gt;length &gt; 0) { for ($i=0 ; $i&lt;$nodeListSelects-&gt;length ; $i++) { $nodeSelect = $nodeListSelects-&gt;item($i); $name = $nodeSelect-&gt;getAttribute('name'); $nodeListOptions = $xpath-&gt;query('option[@selected="selected"]', $nodeSelect); // We want options that are inside the current select if ($nodeListOptions-&gt;length &gt; 0) { for ($j=0 ; $j&lt;$nodeListOptions-&gt;length ; $j++) { $nodeOption = $nodeListOptions-&gt;item($j); $value = $nodeOption-&gt;getAttribute('value'); var_dump("name='$name' =&gt; value='$value'"); } } } } } else { // too bad... } </code></pre> <p>And I get as an output :</p> <pre><code>string 'name='sl' =&gt; value='fr'' (length=23) string 'name='tl' =&gt; value='en'' (length=23) string 'name='sl' =&gt; value='en'' (length=23) string 'name='tl' =&gt; value='fr'' (length=23) string 'name='sl' =&gt; value='en'' (length=23) string 'name='tl' =&gt; value='fr'' (length=23) </code></pre> <p>Which is what I expected.</p> <p><br>Some explanations ?</p> <p>Well, first of all, I get all the select tags of the page, and keep their name in memory. <br>Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw). <br>And here, I have the value.</p> <p>A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D</p> <p>Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^</p> <p><br> I hope this helps a bit more... <br>Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? <strong>;-)</strong></p> <p>Still : have fun !</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload