Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<blockquote> <ul> <li><strong><a href="http://code.google.com/p/phpquery/" rel="nofollow noreferrer">phpQuery</a></strong> is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.</li> </ul> </blockquote> <hr> <p><strong>UPDATE 2</strong></p> <blockquote> <ul> <li><strong>DEMO:</strong> <a href="http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/" rel="nofollow noreferrer">http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/</a></li> <li><em>tested on a casual blogs list taken from</em> <a href="http://technorati.com/blogs/top100" rel="nofollow noreferrer">Technorati Top 100</a> <em>and</em> <a href="http://www.time.com/time/specials/packages/completelist/0,29569,1999770,00.html" rel="nofollow noreferrer">Best Blogs of 2010</a></li> </ul> </blockquote> <ol> <li>many blogs make use of <a href="http://php.opensourcecms.com/" rel="nofollow noreferrer">CMS</a>;</li> <li>blogs html structure is the same almost the time.</li> <li>avoid common selectors like <code>#sidebar, #header, #footer, #comments, etc..</code></li> <li>avoid any widget by tag name <code>script, iframe</code></li> <li>clear well know content like: <ol> <li><code>/\d+\scomment(?:[s])/im</code></li> <li><code>/(read the rest|read more).*/im</code></li> <li><code>/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im</code></li> <li><code>/[^a-z0-9]+/im</code></li> </ol></li> </ol> <hr> <p>search for well know classes and ids:</p> <ul> <li>typepad.com <code>.entry-content</code></li> <li>wordpress.org <code>.post-entry .entry .post</code></li> <li>movabletype.com <code>.post</code></li> <li>blogger.com <code>.post-body .entry-content</code></li> <li>drupal.com <code>.content</code></li> <li>tumblr.com <code>.post</code></li> <li>squarespace.com <code>.journal-entry-text</code></li> <li>expressionengine.com <code>.entry</code></li> <li><p>gawker.com <code>.post-body</code></p></li> <li><p><strong>Ref:</strong> <a href="http://royal.pingdom.com/2009/01/15/the-blog-platforms-of-choice-among-the-top-100-blogs/" rel="nofollow noreferrer">The blog platforms of choice among the top 100 blogs</a></p></li> </ul> <hr> <pre><code>$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content'); $doc = phpQuery::newDocumentFile('http://blog.com')-&gt;find($selectors)-&gt;children('p,div'); </code></pre> <hr> <p>search based on common html structure that look like this:</p> <pre><code>&lt;div&gt; &lt;h1|h2|h3|h4|a /&gt; &lt;p|div /&gt; &lt;/div&gt; </code></pre> <hr> <pre><code>$doc = phpQuery::newDocumentFile('http://blog.com')-&gt;find('h1,h2,h3,h4')-&gt;parent()-&gt;children('p,div'); </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload