Note that there are some explanatory texts on larger screens.

plurals
  1. POSearch for links in webpage, follow them, and return pages with no links in R
    primarykey
    data
    text
    <p>I am writing an update to my <a href="http://cran.r-project.org/web/packages/rNOMADS/index.html" rel="nofollow">rNOMADS package</a> to include all the models on the NOMADS web site. To do this, I must search the html directory tree for each model. I do not know how deep this tree is, or how many branches it contains, beforehand. Therefore I am writing a simple web crawler to recursively search the top page for links, follow each link, and return the URLs of pages that have no more links. Such a page is the download page for model data. <a href="http://nomads.ncep.noaa.gov/cgi-bin/filter_cmcens.pl" rel="nofollow">Here is an example</a> of a URL that must be searched.</p> <p>I want to get the addresses of all web pages below this one. I have attempted this code:</p> <pre><code>library(XML) url &lt;- "http://nomads.ncep.noaa.gov/cgi-bin/filter_cmcens.pl" WebCrawler &lt;- function(url) { doc &lt;- htmlParse(url) links &lt;- xpathSApply(doc, "//a/@href") free(doc) if(is.null(links)) { #If there are no links, this is the page we want, return it! return(url) } else { for(link in links) { #Call recursively on each link found print(link) return(WebCrawler(link)) } } } </code></pre> <p>However, I have not figured out a good way to return a list of all the "dead end" pages. Instead, this code will only return one model page, not the whole list of them. I could declare a global variable and have the URLS saved to that variable, but I am wondering if there is a better way to go about this. How should I go about constructing this function to give me a list of <em>every single page</em>?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload