Note that there are some explanatory texts on larger screens.

plurals
  1. POIn R, how to parse specific frame within a webpage?
    text
    copied!<p>Greetings all,</p> <p>Is there a way to only read the HTML code from a specific frame within a webpage?</p> <p>For example, if I submit a url to google translate, is there a way to parse only the translated page frame? Whenever I try, I can only access the top frame on the page but not the translated frame. Here is my self-contained sample code:</p> <pre><code>library(XML) url &lt;- "http://www.baidu.com/s?wd=r+project" url.google.translate &lt;- URLencode(paste("http://translate.google.com/translate?js=y&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=1&amp;eotf=1&amp;sl=zh-CN&amp;tl=en&amp;u=", url, sep="")) htmlTreeParse(url.google.translate, useInternalNodes = FALSE) </code></pre> <p>The above code refers to this url:</p> <pre><code>$file [1] "http://translate.google.com/translate?js=y&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=1&amp;eotf=1&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s?wd=r+project" </code></pre> <p>The output however only access the top frame of the page and not the main frame, which is what I am interested in. </p> <p>Hope that made sense and thanks in advance for any help.</p> <p>Tony</p> <p><strong>UPDATE - Thanks to the answer from @kwantam below (accepted), I was able to use it to get my solution as follows (self-contained):</strong></p> <pre><code>&gt; # Load R packages &gt; library(RCurl) &gt; library(XML) &gt; &gt; # STAGE 1 - find forward url in relevent frame &gt; ( url &lt;- "http://www.baidu.com/s?wd=r+project" ) [1] "http://www.baidu.com/s?wd=r+project" &gt; gt.url &lt;- URLencode(paste("http://translate.google.com/translate?js=y&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=1&amp;eotf=1&amp;sl=zh-CN&amp;tl=en&amp;u=", url, sep="")) &gt; gt.doc &lt;- getURL(gt.url) &gt; gt.html &lt;- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){}) &gt; nodes &lt;- getNodeSet(gt.html, '//frameset//frame[@name="c"]') &gt; gt.parameters &lt;- sapply(nodes, function(x) x &lt;- xmlAttrs(x)[[1]]) &gt; gt.url &lt;- paste("http://translate.google.com", gt.parameters, sep = "") &gt; &gt; # STAGE 2 - find forward url to translated page &gt; doc &lt;- getURL(gt.url, followlocation = TRUE) &gt; html &lt;- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){}) &gt; url.trans &lt;- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]]) &gt; url.trans &lt;- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2] &gt; url.trans &lt;- gsub("\"/&gt;", "", url.trans, fixed = TRUE) &gt; url.trans &lt;- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]]) &gt; &gt; # STAGE 3 - load translated page &gt; url.trans [1] "http://translate.googleusercontent.com/translate_c?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;rurl=translate.google.com&amp;usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A " &gt; #getURL(url.trans) </code></pre> <p><em><strong>If anyone knows of a simpler solution to what I've given above then please feel free to let me know! :)</em></strong></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload