StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>There's really no easy answer to this. You can certainly store the Jaccard index (actually I'd just store the ones that meet the criteria, and throw out the rest), but the real problem is calculating it (effectively have to scan all of your existing order each time a new order was entered in to the system to calculate the new index).</p> <p>That can be quite expensive depending on your volume of orders that's you're maintaining. Maybe you only compare it to the last year of orders, or something.</p> <p>If you're doing it on the fly, it gets more interesting, but still expensive.</p> <p>You can readily get a list of all orders that have the same product items. One list per item. This, in fact, is not necessarily a lot of data (if you have a lot of orders for a single popular item, then it can be a long list). The individual queries aren't particularly insane either (again depending on your data). If you have a vast vast amount of data, the query can be readily map/reduced and even work with sharded data stores. Bitmap indexes (if your DB support this) are particularly good for getting lists like this quite quickly.</p> <p>Then you can simply count the times that an order number occurs in all of the lists, and then drop those off that don't meet the threshold. That's a straight forward merge operation.</p> <p>But you'd have to do this calculation every single time you'd want the information, since you can't really store it.</p> <p>So, it really does boil down to what you need the information for, how often you need it, your items <-> order distribution, how long you can wait for it, etc.</p> <p>Addenda:</p> <p>Thinking about it a little more, this is a simple query, but it may take some time to run. Likely not much with modern hardware, you don't really have that much data. For a single screen viewing an order you wouldn't notice it. If you were running report across all orders, then you would definitely notice it -- and would need a different approach.</p> <p>Lets consider an order with 20 line items.</p> <p>And you want an 85% match. That means orders that have 17 or more items in common.</p> <p>Here is a query that will give you the orders you're interested in:</p> <pre><code>SELECT orderId, count(*) FROM OrderItem WHERE itemId in ('list', 'of', 'items', 'in', 'order', 123, 456, 789) GROUP BY orderId HAVING count(*) >= 17 </code></pre> <p>So, this gives you a collection of all the line items with the same items as your order. Then you simply sum them up by orderId, and those that are equal to or greater than your threshold (17 in this case), are the candidate orders.</p> <p>Now, you don't say how many items you have in your catalog. If you have 1000 items, perfectly distributed, this query will chew on 1600 rows of data -- which is no big deal. With proper indexes this should go quite quickly. However, if you have items that are "really popular", then you're going to chew through a lot more rows of data.</p> <p>But, again, you don't have that much data. Most of this query can be done within the indexes on a proper database and not even hit the actual tables. So, as I said, you'll likely not notice the impact of this query on a interactive system.</p> <p>So, give it a try and see how it goes for you.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload