Note that there are some explanatory texts on larger screens.

plurals
  1. POPerl XPath statement with a conditional - is that possible?
    text
    copied!<p>This question has been rephrased. I am using CPAN Perl modules <a href="http://search.cpan.org/~jesse/WWW-Mechanize-1.71/lib/WWW/Mechanize.pm#%24mech-%3Eback%28%29" rel="nofollow">WWW::Mechanize</a> to navigate a website, <a href="http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.14/lib/HTML/TreeBuilder/XPath.pm#findvalues_%28%24path%29" rel="nofollow">HTML::TreeBuilder-XPath</a> to capture the content and <a href="https://metacpan.org/module/Xacobeo" rel="nofollow">xacobeo</a> to test my XPath code on the HTML/XML. The goal is to call this Perl script from a PHP-based website and upload the scraped contents into a database. Therefore, if content is "missing" it still needs to be accounted for.</p> <p>Below is a tested, reduced sample code depicting my challenge. Note:</p> <ol> <li>This page is dynamically filled and contains various <code>ITEMS</code> outputted for different stores; a different number of <code>Products*</code> will exist for each store. And those product listings may or may not have an itemized table underneath of it.</li> <li>The captured data has to be in arrays and the association of any itemized list (if it exists) to the Product listing has to be maintained. </li> </ol> <p>Below, the example xml changes per store (as described above) but for brevity I only show one "type" of output. I realize that all data can be captured into one array and then regex used to decipher the content for the purpose of uploading it into a database. I am seeking a better knowledge of XPath to help streamline this (and future) solution(s).</p> <pre><code>&lt;!DOCTYPE XHTML&gt; &lt;table id="8jd9c_ITEMS"&gt; &lt;tr&gt;&lt;th style="color:red"&gt;The Products we have in stock!&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;&lt;span id="Product_NUTS"&gt;We have nuts!&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt; &lt;!--Table may or may not exist --&gt; &lt;table&gt; &lt;tr&gt;&lt;td style="color:blue;text-indent:10px"&gt;Almonds&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td style="color:blue;text-indent:10px"&gt;Cashews&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;/tr&gt; &lt;/table&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;&lt;span id="Product_VEGGIES"&gt;We have veggies!&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt; &lt;!--Table may or may not exist --&gt; &lt;table&gt; &lt;tr&gt;&lt;td style="color:blue;text-indent:10px"&gt;Carrots&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td style="color:blue;text-indent:10px"&gt;Celery&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;/tr&gt; &lt;/table&gt; &lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;&lt;span id="Product_ALCOHOL"&gt;We have booze!&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt; &lt;!--In this case, the table does not exist --&gt; &lt;/table&gt; </code></pre> <p>An XPath statement of:</p> <pre><code>'//table[contains(@id, "ITEMS")]/tr[position() &gt;1]/td/span/text()' </code></pre> <p>would find:</p> <pre><code>We have nuts! we have veggies! We have booze! </code></pre> <p>And an XPath statement of:</p> <pre><code>'//table[contains(@id, "ITEMS")]/tr[position() &gt;1]/td/table/tr/td/text()' </code></pre> <p>would find:</p> <pre><code>Almonds Cashews Carrots Celery </code></pre> <p>The two XPath statements can be combined:</p> <pre><code>'//table[contains(@id, "ITEMS")]/tr[position() &gt;1]/td/span/text() | //table[contains(@id, "ITEMS")]/tr[position() &gt;1]/table/tr/td/text()' </code></pre> <p>To find:</p> <pre><code>We have nuts! Almonds Cashews We have veggies! Carrots Celery We have booze! </code></pre> <p>Again, the above array can be deciphered (in the real code) for it's product-to-list association using regex. <strong>But can the array be built using XPath in a manner that would keep that association?</strong> </p> <p>For example (pseudo-speak, this does not work):</p> <pre><code>'//table[contains(@id, "ITEMS")]/tr[position()&gt;1]/td/span/text() | if exists('//table[contains(@id, "ITEMS")]/tr[position() &gt;1]/table)) then ("NoTable") else ("TableRef") | Save this result into @TableRef ('//table[contains(@id, "ITEMS")]/tr[position() &gt;1]/table/tr/td/text()')' </code></pre> <p>It is not possible to build multi-dimensional arrays (in the traditional sense) in Perl, see <a href="http://perldoc.perl.org/5.10.1/perlref.html" rel="nofollow">perldoc perlref</a> But hopefully a solution similar to the above could create something like:</p> <pre><code>@ITEMS[0] =&gt; We have nuts! @ITEMS[1] =&gt; nutsREF &lt;-- say, the last word of the span value + REF @ITEMS[2] =&gt; We have veggies! @ITEMS[3] =&gt; veggiesREF &lt;-- say, the last word of the span value + REF @ITEMS[4] =&gt; We have booze! @ITEMS[5] =&gt; NoTable &lt;-- value accounts for the missing info @nutsREF[0] =&gt; Almonds @nutsREF[1] =&gt; Cashews @veggiesREF[0] =&gt; Carrots @veggiesREF[1] =&gt; Celery </code></pre> <p>In the real code the Products are known, so <code>my @veggiesREF</code> and <code>my @nutsREF</code> can be defined in anticipation of the XPath output.</p> <p>I realize the XPath if/else/then functionality is in the XPath 2.0 version. I am on a ubuntu system and working locally, but I am still not clear on whether my apache2 server is using it or the 1.0 version. How do I check that? </p> <p>Finally, if you can show how to call a Perl scrip from a PHP form submit AND how to pass back a Perl array to the calling PHP function then that would go along way to getting the bounty. :) </p> <p>Thanks! </p> <p><strong>FINAL EDIT:</strong></p> <p>Comments immediately below this post were directed at an initial post that was too vague. The subsequent re-post (and bounty) was responded to by ikegami with a very creative use which solved the pseudo problem, but was proving difficult for me to grasp and reuse in my real application - which entails multiple uses on various html pages. In about the 18th comment in our dialog I finally discovered his meaning and use of ($cat) - an undocumented Perl syntax that he used. For new readers, understanding that syntax makes it possible to understand (and reformat) his intelligent solution to the problem. His post certainly meets the basic requirements sought in the OP but does not use HTML::TreeBuilder::XPath to do it. </p> <p>jpalecek uses the HTML::TreeBuilder::XPath but does not place the captured data into arrays for passing back to a PHP function and uploading into a database.</p> <p>I have learned from both responders and hope this post helps others who are new to Perl, like myself. Any final contributions would be greatly appreciated.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload