Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Your solution 2: using hash value doesn't force a memory problem. You just have to partition the hash space into slices that fits into memory. More precisely:</p> <p>Consider a hash table storing the set of records, each record is only represented by its index in the table. Say for example that such a hash table will be 4GB. Then you split your hash space in k=4 slice. Depending on the two last digits of the hash value, each record goes into one of the slice. So the algorithm would go roughly as follows:</p> <pre><code>let k = 2^M for i from 0 to k-1: t = new table for each record r on the disk: h = hashvalue(r) if (the M last bit of h == i) { insert r into t with respect to hash value h &gt;&gt; M } search t for duplicate and remove them delete t from memory </code></pre> <p>The drawback is that you have to hash each record k times. The advantage is that is it can trivially be distributed.</p> <p>Here is a prototype in Python:</p> <pre><code># Fake huge database on the disks records = ["askdjlsd", "kalsjdld", "alkjdslad", "askdjlsd"]*100 M = 2 mask = 2**(M+1)-1 class HashLink(object): def __init__(self, idx): self._idx = idx self._hash = hash(records[idx]) # file access def __hash__(self): return self._hash &gt;&gt; M # hashlink are equal if they link to equal objects def __eq__(self, other): return records[self._idx] == records[other._idx] # file access def __repr__(self): return str(records[self._idx]) to_be_deleted = list() for i in range(2**M): t = set() for idx, rec in enumerate(records): h = hash(rec) if (h &amp; mask == i): if HashLink(idx) in t: to_be_deleted.append(idx) else: t.add(HashLink(idx)) </code></pre> <p>The result is:</p> <pre><code>&gt;&gt;&gt; [records[idx] for idx in range(len(records)) if idx not in to_be_deleted] ['askdjlsd', 'kalsjdld', 'alkjdslad'] </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload