Note that there are some explanatory texts on larger screens.

plurals
  1. POWebcrawl with R
    text
    copied!<p>I have a problem I would like some help at. I need to create a piece of R code that can load in a csv file. The csv file contain one column named "Link" and for each i(Row) there is a link from which the code need to download the content of the link and place it in a separate csv file. Until now I have managed find and modify the piece of code showed below. (Thanks to Christopher Gandrud and co authors) </p> <pre><code>library(foreign) library(RCurl) addresses &lt;- read.csv("&gt;&gt;PATH TO CSV FILE&lt;&lt;") for (i in addresses) full.text &lt;- getURL(i) text &lt;- data.frame(full.text) outpath &lt;-"&gt;&gt;PATH TO SPECIFIED FOLDER&lt;&lt;" x &lt;- 1:nrow(text) for(i in x) { write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep="")) } </code></pre> <p>Actually the code works perfectly, BUT the problem is that I am overloading the server with requests, so after having downloaded the correct content from 100-150 links, the files are just empty. I know for a fact that this is the problem since I have tested it many times with a decreasing number of links. Actually if I just download 100 links at the time it is no problem. Above 100 it starts becoming a problem. Non the less I need to implement a couple of things into this piece of code for it to become a good crawler for this particular task. </p> <p>I have divided my problem into two because solving problem one should solve the case temporarily.</p> <ol> <li><p>I want to use the Sys.Sleep function for every 100 downloads. So the code fires 100 requests for the first 100 links and then it pauses for x seconds before it fires the next 100 requests... </p></li> <li><p>Having done that with all rows/links in my dataset/csv file I need it to check each csv file for two conditions. They cannot be empty and they cannot contain a certain error message the server gives me in some special cases. If one of these two condtions are true then it need to save the filename(link number) into a vector I can work with from there.</p></li> </ol> <p>Wow this question suddenly got pretty long. I realize it is a big question and I am asking a lot. It is for my master thesis which is not really about R programming but I need to download the content from a lot of websites which I have been given access to. Next I have to analyze the content, which is what my thesis is about. Any suggestions/comments are welcome.</p> <hr> <pre><code> library(foreign) library(RCurl) addresses &lt;- read.csv("~/Dropbox/Speciale/Mining/Input/Extract post - Dear Lego n(250).csv") for (i in addresses) { + if(i == 50) { + print("Why wont this work?") + Sys.sleep(10) + print(i) + } + else { + print(i) + } + } </code></pre> <p>"And then a whole list over the links loaded in. No "Why wont this work" at i == 50" followed by</p> <p>Warning message </p> <pre><code>In if (i == 100) {: the condition has length &gt; 1 and only the first element will be used full.text &lt;- getURL(i) text &lt;- data.frame(full.text) outpath &lt;-"~/Dropbox/Speciale/Mining/Output" x &lt;- 1:nrow(text) for(i in x) { write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep=""))} </code></pre> <p>Able to help me more?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload