Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>With two points of the requirement being about inexact solution, I'm guessing they're looking for a fast shortcut approximation instead of an exhaustive search. So here's my idea:</p> <p>Suppose that there is absolutely no correlation between a fan's choices for favorite artists. This is, of course, surely false. Someone who likes Rembrandt is far more likely to also like Rubens then he is to also like Pollock. (You did say we were picking favorite artists, right?) I'll get back to this in a moment.</p> <p>Then make a pass through the data, counting the number of distinct artists, the number of fans, and how often each artist shows up as a favorite. When we're done making this pass: (1) Eliminate any artists who don't individually show up the required number of "pair times". If an arist only shows up 40 times, he can't possibly be included in more than 40 pairs. (2) For the remaining artists, convert each "like count" to a percentage, i.e. this artist was liked by, say, 10% of the fans. Then for each pair of artists, multiple their like percentages together and then multiply by the total number of fans. This is the estimated number of times they'd show up as a pair.</p> <p>For example, suppose of 1000 fans, 200 say they like Rembrandt and 100 say they like Michaelangelo. That means 20% for Rembrandt and 10% for Michaelangelo. So if there's no correlation, we'd estimate that 20% * 10% * 1000 = 20 like both. This is below the threshold so we wouldn't include this pair.</p> <p>The catch to this is that there almost surely is a correlation between "likes". My first thought would be to study some real data and see how much of a correlation there is, that is, how the real pair counts differs from the estimate. If we find that, say, the real count is rarely more than twice the estimated count, then we could just say that any pair that gives an estimate over 1/2 of the threshold we declare a "candidate". Then we do an exhaustive count on the candidates to see how many really meet the condition. This would allow us to eliminate all the pairs that fall well below the threshold as "unlikely" and thus not worth the cost of investigating.</p> <p>This could miss pairs when the artists almost always occur together. If, say, 100 fans like Picasso, 60 like Van Gogh, and of the 60 who like Van Gogh 50 also like Picasso, their estimate will be MUCH lower than their actual. If this happens rarely enough it may fall into the acceptable "exact answer not required" category. If it happens all the time this approach won't work.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload