Note that there are some explanatory texts on larger screens.

plurals
  1. POHtmlUnit failing when it tries to open dead javascript links. Is there a way to tell it not to load specific URLs?
    text
    copied!<p>I'm trying to do a little scraping on <a href="http://www.eci-polldaymonitoring.nic.in/psl/default.aspx" rel="nofollow">this site</a> to programatically find polling info. I originally tried Python, which worked great for loading the site and navigating around the <code>aspx</code> forms, but couldn't extract the embedded maps data (since no packages (as of yet) handle javascript). So I've opted to dust off my Java skills and break out HtmlUnit. However, I almost instantly hit a snag. </p> <p>It appears as though there are some dead links to javascript files on the site that don't exists. When HtmlUnit tries to load them it gets a 404 and self destructs. </p> <h3>Specific Error</h3> <pre><code>Jul 21, 2013 9:51:22 PM com.gargoylesoftware.htmlunit.html.HtmlPage loadExternalJavaScriptFile SEVERE: Error loading JavaScript from [http://www.eci-polldaymonitoring.nic.in/psl/GoogleMapForASPNet.ascx/jsdebug]. com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 404 Not Found for http://www.eci-polldaymonitoring.nic.in/psl/GoogleMapForASPNet.ascx/jsdebug at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:544) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1119) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1059) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:399) at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:260) at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:276) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:676) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:635) at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170) at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072) at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206) at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330) at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3074) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2041) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:892) at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:241) at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:187) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359) at ScrapeTest$.main(ScrapeTest.scala:12) at ScrapeTest.main(ScrapeTest.scala) </code></pre> <p>Is there a way to tell it to either (a) ignore 404 errors completely, or (b) ignore specific javascript urls? </p> <h3>My Code thus far (Scala)</h3> <pre><code>import com.gargoylesoftware.htmlunit.WebClient import com.gargoylesoftware.htmlunit.BrowserVersion import com.gargoylesoftware.htmlunit.html.HtmlPage object ScrapeTest { def main(args: Array[String]): Unit = { val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/" val client = new WebClient(BrowserVersion.INTERNET_EXPLORER_8) var response: HtmlPage = client.getPage(pageurl) println(response.asText()) } } </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload