Note that there are some explanatory texts on larger screens.

plurals
  1. POCategorizing data based on the data's signature
    primarykey
    data
    text
    <p>Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:</p> <pre><code>1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")] 2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")] n) .... n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")] </code></pre> <p>I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?</p> <p>The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive. </p> <p>I am looking for a solution of the following form.</p> <ul> <li><p>A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.</p></li> <li><p>I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B. </p></li> <li><p>When I have a new row, I wish to know which existent row in database B is most similar, I would:</p> <ol> <li>Generate a signature for the new row</li> <li>Binary search through the sorted list of (signature,index) in database B for the closet match</li> <li>Return the closest matching (could be a perfect match) row in database B.</li> </ol></li> </ul> <p>I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this. </p> <p>Thank you.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload