Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.</p> <p>If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list. </p> <p>Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.</p> <p>My original answer is here I'm leaving it because it's my own related question: </p> <p>I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).</p> <p>I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.</p> <p>The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted. </p> <p>For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.</p> <p>The ngram datasets are here by the way <a href="http://books.google.com/ngrams/datasets" rel="nofollow">http://books.google.com/ngrams/datasets</a></p> <p>This site may have what you want <a href="http://www.wordfrequency.info/" rel="nofollow">http://www.wordfrequency.info/</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload