Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Theoretical interview questions like this always deal in small numbers (like 10 words). However, the number means nothing; it's there to separate out those candidates who really can think around the problem in <em>the general form</em> from those who simply regurgitate fixed answers to fixed interview questions they find on the internet.</p> <p>The best software houses will only favour solutions that are <strong>scalable</strong>. Therefore, you will gain top marks in an interview if your answer is simple, but also scalable to any size of problem (or, in this case, document). Therefore, sorting, loops inside loops, O(n^2) complexity, forget them all. If you presented any solutions like these to a leading-edge software company at interview you would fail.</p> <p>Your particular question is checking to see if you know about <strong>Hash Tables</strong>. The most efficient solution to this problem can be written in pseudo-code as follows: <br> <br></p> <pre><code>1. Initialise a new hash table. For each word in the document... 2. Generate a hash key for the word. 3. Lookup the word in the hash table using the key. If it is found, 4. Increment the count for the word. Otherwise, 5. Store the new word in table and set its count to one. </code></pre> <p><br> The <strong>most important benefit</strong> of the above solution is that <strong>only a single scan of the document is required</strong>. No reading words into memory and then processing (two scans), no loops in loops (many scans), no sorting (even more passes). After exactly one pass of the document, if you read out the keys in the hash table, the count of each word tells you exactly how many times each word appeared in the document. Any word with a count greater than one is a duplicate.</p> <p>The secret to this solution is its use of hash tables. Generation of the hash key (step 2), key lookup (step 3), and key storage (step 5) can be implemented as near <em>constant-time</em> operations. This means the time these steps take hardly changes as the size of the input set (i.e. number of words) grows. It means that whether it's the 10th word in a document, or the 10 millionth word, inserting that word into the hash table (or looking it up) will take roughly the same <em>very small</em> amount of time. In this case, we additionally keep a count of each word's frequency in step 5. Incrementing a value is known to be a very efficient fixed-time operation.</p> <p>Any solution to this problem must scan all words in the document at least once. As our solution processes each word <em>exactly</em> once, with all words taking approximately the same very small <em>constant time</em> to process, we say our solution performs <em>optimally</em> and scales <em>linearly</em>, yielding <em>O(n) performance</em> (put simply, processing 1,000,000 words will take around 1000 times longer than processing 1000 words). In all, a scalable and efficient solution to the problem</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload