Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Answer to this question totally depend on the size of unique words, If the unique word count is small then you can use any string->number mapping data-structure (e.g. Trie tree) to count word frequency. Complexity will be <code>n log(m)</code> (m is the length of individual words), easy to implement. But the way problem was described, most probably unique word count is big enough to be able to store in the memory. In that case this following approach can be used:</p> <p>1 TB data means there are about <code>1.0*10^12</code> bytes of data in the input file. 1 byte is one character and lets say on average a single word has 4 characters then we have about <code>2.5*10^11</code> words. We will divide this word list in to <code>50k</code> different word lists. So, every time we will read about <code>5m</code> unread words from the input file, sort this <code>5m</code> word list and write this sorted list in a file. We will use <code>50k</code> array of numbers (lets call it <code>Parray</code>) to store the starting location of all sorted list in the file (initially <code>Parray</code> will have numbers like: 0, 5m+1, 10m+1 etc). Now read the top <code>50k</code> words from all list, put them in a min heap, you will get the smallest word on top of the heap. After getting the current smallest word (lets call it <code>cur_small</code>) from all sorted list read out that word from each list (after this operation your <code>Parray</code> will point to the next smallest word in each list). Here you will get the count of <code>cur_small</code> - so take the decision based on <code>K</code> then remove all entries of <code>cur_small</code> from the heap and lastly add a new word from each list to the heap from where at-least one word was <code>cur_small</code>. Continue this process until you read out all the sorted lists. Over all complexity is <code>n log(n)</code> </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload