StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POXpath query to grab text between different html tags
text
Body
copied!<p>I am using R to screen scrape. I've grabbed a page and I've managed to find all the links on the page that found in a certain place on the page (anchor tags within anchor tags with a name attribute) using:</p> <pre><code>links <- xpathSApply(doc, "//a[@name]//a/@href") </code></pre> <p>Now I have grabbed got the documents from the links with Curl and I want to scrape a certain amount of text. The text seems to always be between an <code><p></code> tag (although there are other <code><p></code> tags in the text and end before the following text </p> <pre><code></pre><hr>Back to: <a href="#TOP"> </code></pre> <p>I decided to grab all the text between <code><p></code> and <code><a href="#TOP"></code> and I cant seem to nail the xpath query. So far I have got:</p> <pre><code>text <- xpathSApply(doc, '"/ //text()[preceding:://a/@href="#TOP"] and following::*//p') </code></pre> <p>Could anyone point me in the right direction? There are quite a few xpath answers on stackoverflow but they don't always explain the answer which makes it hard to edit them for my own use.</p> <p>Sample HTML:</p> <pre><code><span ID="MSGHDR-CONTENT-TYPE-H-PRE">Content-type:</b></span> <span ID="MSGHDR-CONTENT- TYPE-PRE">text/plain; charset=us-ascii</span> </span><p> lots and lots of text here that I want </pre><hr>Back to: <a href="#TOP">Top of message</a> &#124; <a href="/cgi-bin/wa?A1=ind9709&L=cybcom&D=0">Previous page</a> &#124; <a href="/cgi-bin/wa?A0=cybcom&D=0">Main CYBCOM page</a><p> </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload