Note that there are some explanatory texts on larger screens.

plurals
  1. POHtmlUnit and XPath: DOMNode.getByXPath only works on HtmlPage?
    primarykey
    data
    text
    <p>I'm trying to parse <a href="http://living.scotsman.com/sectionhome.aspx?sectionID=7063" rel="nofollow noreferrer">a page</a> with links to articles whose important content looks like this:</p> <pre><code>&lt;div class="article"&gt; &lt;h1 style="float: none;"&gt;&lt;a href="performing-arts"&gt;Performing Arts&lt;/a&gt;&lt;/h1&gt; &lt;a href="/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp"&gt; &lt;span class="mth3"&gt; &lt;span id="wctlMiniTemplate1_ctl00_ctl00_ctl01_WctlPremiumContentIcon1"&gt; &lt;/span&gt; EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus &lt;/span&gt; &lt;span class="mtp"&gt;The EIF&amp;#39;s theatre programme wasn&amp;#39;t as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher &lt;/span&gt; &lt;/a&gt; &lt;/div&gt; </code></pre> <p>Here is a minimal scraping case in Java using HtmlUnit and XPath (imports removed for brevity):</p> <pre><code>public class MinimalTest { public static void main(String[] args) throws Exception { WebClient client = new WebClient(); client.setJavaScriptEnabled(false); client.setCssEnabled(false); System.out.println("Fetching front page"); HtmlPage frontPage = client.getPage("http://living.scotsman.com/sectionhome.aspx?sectionID=7063"); List&lt;ArticleInfo&gt; articleInfos = extractArticleInfo(frontPage); for (ArticleInfo info : articleInfos) { System.out.println("Title: " + info.getTitle()); System.out.println("Intro: " + info.getFirstPara()); System.out.println("Link: " + info.getLink()); } } @SuppressWarnings("unchecked") // xpath returns List&lt;?&gt; private static List&lt;ArticleInfo&gt; extractArticleInfo(HtmlPage frontPage) { System.out.println("Extracting article links"); List&lt;HtmlDivision&gt; articleDivs = (List&lt;HtmlDivision&gt;) frontPage.getByXPath("//div[@class='article']"); System.out.println(String.format("Found %d articles", articleDivs.size())); List&lt;ArticleInfo&gt; articleLinks = new ArrayList&lt;ArticleInfo&gt;(articleDivs.size()); for (HtmlDivision div : articleDivs) { articleLinks.add(ArticleInfo.constructFromArticleDiv(div)); } return articleLinks; } private static class ArticleInfo { private final String title; private final String link; private final String firstPara; public ArticleInfo(final String link, final String title, final String firstPara) { this.link = link; this.title = title; this.firstPara = firstPara; } public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) { String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText(); String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText(); String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText(); return new ArticleInfo(link, title, firstPara); } public String getTitle() { return title; } public String getFirstPara() { return firstPara; } public String getLink() { return link; } } } </code></pre> <p>Output I expect:</p> <pre><code>Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher Link: http://living.scotsman.com/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp </code></pre> <p>What I get:</p> <pre><code>Fetching front page Extracting article links Found 24 articles Exception in thread "main" java.lang.NullPointerException at com.allthefestivals.app.crawler.MinimalTest$ArticleInfo.constructFromArticleDiv(MinimalTest.java:68) at com.allthefestivals.app.crawler.MinimalTest.extractArticleInfo(MinimalTest.java:50) at com.allthefestivals.app.crawler.MinimalTest.main(MinimalTest.java:30) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115) </code></pre> <p>Calling <code>getByXPath</code> works fine on a <code>HtmlPage</code> but seems to return nothing on any other <code>HtmlElement</code>. What's wrong? Is this a bug or implementation gap in HtmlUnit, or am I missing something subtle about XPath syntax?</p> <p>Related question whose solution didn't work for me: <a href="https://stackoverflow.com/questions/2980792/xpath-relative-to-given-element-in-htmlunit-groovy">XPath _relative_ to given element in HTMLUnit/Groovy?</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload