StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.</p> <p>If so, you might say x is your text, then use <code>strsplit(x, "STATEMENT OF")</code> to split on the words STATEMENT OF, then <code>grep()</code> or <code>str_extract()</code> to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?). </p> <p>Have a look here for more on the use of these functions, and text manipulation in general in <code>R</code>: <a href="http://en.wikibooks.org/wiki/R_Programming/Text_Processing" rel="nofollow">http://en.wikibooks.org/wiki/R_Programming/Text_Processing</a></p> <p><strong>UPDATE</strong> Here's a more complete answer...</p> <pre><code>#create object containing all text x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon. In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings. STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.") # split object on first two words y <- unlist(strsplit(x, "STATEMENT OF")) #load library containing handy function library(stringr) # use word() to return words in positions 3 to 4 of each string, which is where the first and last names are z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line z # have a look at the result... [1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE," </code></pre> <p>No doubt a regular expressions wizard could come up with something to do it quicker and neater!</p> <p>Anyway, from here you can run a function to calculate word freqs on each line in the vector <code>y</code> (ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload