Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Do you really need to precompute all the possible pairs? What if you were to do it lazily, i.e. on an on-demand basis?</p> <p>This can be represented as a 2D matrix. The rows correspond to the customers and the columns correspond to the products.</p> <p>Each entry is either 0 or 1, saying whether the product corresponding to the column was bought by the customer corresponding to the row.</p> <p>If you look at each column as a vector of (around 5000) 0s and 1s, then the number of times two products were bought together is just the dot product of the corresponding vectors!</p> <p>Thus you can just compute those vectors first, and then compute the dot product lazily, on demand.</p> <p>To compute the dot-product:</p> <p>Now, a good representation of a vector with only 0s and 1s is an array of integers, which is basically a bitmap.</p> <p>For 5000 entries, you will need an array of 79 64-bit integers.</p> <p>So given two such arrays, you need to count the number of 1s that are common. </p> <p>To count the number of bits common to two integers, first you can do the bitwise AND and then count the numbers of 1s that are set in the resulting number.</p> <p>For this either you can use lookup tables or some bitcounting methods (not sure if python will support them), like here: <a href="http://graphics.stanford.edu/~seander/bithacks.html" rel="nofollow noreferrer">http://graphics.stanford.edu/~seander/bithacks.html</a></p> <p>So your algorithm will be something like this:</p> <ul> <li><p>Initialize an array of 79 64 bit integers for each product.</p></li> <li><p>For each customer, look at the products bought and set the appropriate bit for that customer in that corresponding products.</p></li> <li><p>Now given a query of two products for which you need to know the number of customer who bought them together, just take the dot-product as describe above.</p></li> </ul> <p>This should be reasonably fast. </p> <p>As a further optimization, you can possibly consider grouping the customers.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload