Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Tal,</p> <p>You could use R and the <code>XML</code> package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the <code>readHTMLTable()</code> function, <a href="https://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package">which is covered in this previous thread</a>.</p> <p>Given this ugly HTML, however, we will have to use the <code>RCurl</code> package to pull the raw HTML and create some custom functions to parse it. This problem has two components:</p> <ol> <li>Get all of the genome URLS from the base webpage (<a href="http://gtrnadb.ucsc.edu/" rel="nofollow noreferrer">http://gtrnadb.ucsc.edu/</a>) using the <code>getURLContent()</code> function in the <code>RCurl</code>package and some regex magic :-)</li> <li>Then take that list of URLS and scrape the data you are looking for, and then stick it into a <code>data.frame</code>.</li> </ol> <p>So, here goes...</p> <pre><code>library(RCurl) ### 1) First task is to get all of the web links we will need ## base_url&lt;-"http://gtrnadb.ucsc.edu/" base_html&lt;-getURLContent(base_url)[[1]] links&lt;-strsplit(base_html,"a href=")[[1]] get_data_url&lt;-function(s) { u_split1&lt;-strsplit(s,"/")[[1]][1] u_split2&lt;-strsplit(u_split1,'\\"')[[1]][2] ifelse(grep("[[:upper:]]",u_split2)==1 &amp; length(strsplit(u_split2,"#")[[1]])&lt;2,return(u_split2),return(NA)) } # Extract only those element that are relevant genomes&lt;-unlist(lapply(links,get_data_url)) genomes&lt;-genomes[which(is.na(genomes)==FALSE)] ### 2) Now, scrape the genome data from all of those URLS ### # This requires two complementary functions that are designed specifically # for the UCSC website. The first parses the data from a -structs.html page # and the second collects that data in to a multi-dimensional list parse_genomes&lt;-function(g) { g_split1&lt;-strsplit(g,"\n")[[1]] g_split1&lt;-g_split1[2:5] # Pull all of the data and stick it in a list g_split2&lt;-strsplit(g_split1[1],"\t")[[1]] ID&lt;-g_split2[1] # Sequence ID LEN&lt;-strsplit(g_split2[2],": ")[[1]][2] # Length g_split3&lt;-strsplit(g_split1[2],"\t")[[1]] TYPE&lt;-strsplit(g_split3[1],": ")[[1]][2] # Type AC&lt;-strsplit(g_split3[2],": ")[[1]][2] # Anticodon SEQ&lt;-strsplit(g_split1[3],": ")[[1]][2] # ID STR&lt;-strsplit(g_split1[4],": ")[[1]][2] # String return(c(ID,LEN,TYPE,AC,SEQ,STR)) } # This will be a high dimensional list with all of the data, you can then manipulate as you like get_structs&lt;-function(u) { struct_url&lt;-paste(base_url,u,"/",u,"-structs.html",sep="") raw_data&lt;-getURLContent(struct_url) s_split1&lt;-strsplit(raw_data,"&lt;PRE&gt;")[[1]] all_data&lt;-s_split1[seq(3,length(s_split1))] data_list&lt;-lapply(all_data,parse_genomes) for (d in 1:length(data_list)) {data_list[[d]]&lt;-append(data_list[[d]],u)} return(data_list) } # Collect data, manipulate, and create data frame (with slight cleaning) genomes_list&lt;-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist &amp; Spurp), a full scrape will take a LONG time genomes_rows&lt;-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation genome_data&lt;-t(sapply(genomes_rows,rbind)) colnames(genome_data)&lt;-c("ID","LEN","TYPE","AC","SEQ","STR","NAME") genome_data&lt;-as.data.frame(genome_data) genome_data&lt;-subset(genome_data,ID!="&lt;/PRE&gt;") # Some malformed web pages produce bad rows, but we can remove them head(genome_data) </code></pre> <p>The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:</p> <pre><code>head(genome_data) ID LEN TYPE AC SEQ 1 Scaffold17302.trna1 (1426-1498) 73 bp Ala AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA 2 Scaffold20851.trna5 (43038-43110) 73 bp Ala AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA 3 Scaffold20851.trna8 (45975-46047) 73 bp Ala AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA 4 Scaffold17302.trna2 (2514-2586) 73 bp Ala AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA 5 Scaffold51754.trna5 (253637-253565) 73 bp Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA 6 Scaffold17302.trna4 (6027-6099) 73 bp Ala AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA STR NAME 1 .&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.. Spurp 2 .&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.. Spurp 3 .&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.. Spurp 4 &gt;&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;.&gt;&gt;&gt;.......&lt;&lt;&lt;.&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;. Spurp 5 .&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.. Spurp 6 &gt;&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;......&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;.&lt;&lt;&lt;&lt;&lt;&lt;&lt;. Spurp </code></pre> <p>I hope this helps, and thanks for the fun little Sunday afternoon R challenge!</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload