Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I am currently writing a web extraction framework. I have 524 tests that get data from 250 websites using XPath. Initially the framework used a HTML parser, HTMLCleaner, but I am currently investigating using <a href="http://docs.seleniumhq.org/" rel="noreferrer">Selenium</a> because I want Javascript support. I have run the tests against the HtmlUnit, Chrome, Firefox and <a href="https://github.com/detro/ghostdriver" rel="noreferrer">PhantomJS drivers</a>. Here is a comparison of the time taken and the number of failures for each approach:</p> <pre><code> Failures Time (secs) HtmlCleaner 0 82 HtmlUnit 169 102 Google Chrome 38 562 Firefox 46 1159 PhantomJS 40 575 </code></pre> <p>Some comments:</p> <ul> <li><p>In some cases the "failures" may not be failures at all, it may be that the extractors are failing because Javascript is re-writing the DOM. I am in the process of analyzing the failures to find the cause.</p></li> <li><p>That said, HtmlUnit is the fastest Selenium driver but it is also unreliable. This unreliability does not just concern Javascript, there are problems processing "messy, dirty, real-world" HTML because something seems to be broken in the tag balancing algorithm. A couple of issues have been raised about this but they have not been fixed - see <a href="https://sourceforge.net/p/htmlunit/bugs/1423/" rel="noreferrer">HTML-UNIT 1423</a> and <a href="https://sourceforge.net/p/htmlunit/bugs/1046/" rel="noreferrer">HTML-UNIT 1046</a>. </p></li> <li><p>Firefox is the slowest Selenium driver, even though I am disabling image loading and stylesheets. This is because it is the slowest to load and initialize, making it considerably slower than Chrome, and every time an extraction fails I need to reload the driver (in the tests I create a pool of 5 drivers to mitigate the URL retrieval delays for all the Selenium web drivers).</p></li> <li><p>PhantomJS achieves a better accuracy than Firefox, slightly lower than Chrome, but in around half the time of Firefox. What is more, I can run it on my dev box, it does not "take over my machine" by launching multiple browsers so I can get on with work.</p></li> </ul> <p>I would highly recommend PhantomJS.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload