Note that there are some explanatory texts on larger screens.

plurals
  1. POFastest way to find relevant results in array from an input array
    primarykey
    data
    text
    <p>As a mostly front-end developer, this is in the realm of computer science that I don't often delve into, but here's my scenario:</p> <p>I've got an input of a string, split on spaces, say <code>"pinto beans"</code></p> <p>I've got a array of results to search, that contains results like: <code>["beans, mung","beans, pinto","beans, yellow","beans, fava"]</code></p> <p>what might be the quickest way (preferably in javascript or php) to find the most "relevant" results, aka most matches, for instance, in the above case, I would like to sort the return array so that <code>"beans, pinto"</code> is put at the top, and the rest come below, and any other results would go below those.</p> <p>My first attempt at this would be to do something like matching each result item against each input item, and incrementing matches on each one, then sorting by most matches to least.</p> <p>This approach would require me to iterate through the entire result array a ton of times though, and I feel that my lack of CS knowledge is leaving me without the best solution here.</p> <p>/* EDIT: Here's how I ended up dealing with the problem: */</p> <p>Based on crazedfred's suggestion and the blog post he mentioned (which was VERY helpful), I wrote some php that basically uses a combination of the trie method and the boyer-moore method, except searching from the beginning of the string (as I don't want to match "bean" in "superbean").</p> <p>I chose php for the ranking based on the fact that I'm using js libraries, and getting real benchmarks while using convenience functions and library overhead wouldn't produce the testable results I'm after, and I can't guarantee that it won't explode in one browser or another.</p> <p>Here's the test data:</p> <p>Search String: <code>lima beans</code></p> <p>Result array (from db): <code>["Beans, kidney","Beans, lima","Beans, navy","Beans, pinto","Beans, shellie","Beans, snap","Beans, mung","Beans, fava","Beans, adzuki","Beans, baked","Beans, black","Beans, black turtle soup","Beans, cranberry (roman)","Beans, french","Beans, great northern","Beans, pink","Beans, small white","Beans, yellow","Beans, white","Beans, chili","Beans, liquid from stewed kidney beans","Stew, pinto bean and hominy"]</code></p> <p>First, I drop both the search string and the result array into php variables, after <code>explode()</code>ing the string on spaces.</p> <p>then, I precompile my patterns to compare the results to:</p> <pre><code>$max = max(array_map('strlen',$input)); $reg = array(); for($m = 0; $m &lt; $max; $m++) { $reg[$m] = ""; for($ia = 0; $ia &lt; count($input); $ia++) { $reg[$m]. = $input[$ia][$m]; } } </code></pre> <p>this gives me something like : <code>["lb","ie","ma","an","s"]</code></p> <p>then, I basically take each result string (split on spaces), and match a case insensitive character class with the corresponding character number to it. If at any point during that comparison process I don't get any matches, I skip the word. This means if only 1 result starts with "b" or "l", I'll only run one comparison per WORD, which is really fast. Basically I'm taking the part of trie that compiles the searches together, and the constant speedup of the Boyer-Moore stuff.</p> <p>Here's the php - I tried <code>while</code>s, but got SIGNIFICANTLY better results with <code>foreach</code>es:</p> <pre><code>$sort = array(); foreach($results as $result) { $matches = 0; $resultStrs = explode(' ', $result); foreach($resultStrs as $r) { $strlen = strlen($r); for($p = 0; $p &lt; $strlen; $p++) { if($reg[$p]) preg_match('/^['.$reg[$p].']/i',$r[$p],$match); if($match==true) { $matches++; } else { break 2; } } } $sort[$result] = $matches; } </code></pre> <p>That outputs an array with the results on the keys, and how many character matches we got in total on the values.</p> <p>The reason I put it that way is to avoid key collisions that would ruin my data, and more importantly, so I can do a quick <code>asort</code> and get my results in order.</p> <p>That order is in reverse, and on the keys, so after the above code block, I run:</p> <pre><code>asort($sort); $sort = array_reverse(array_keys($sort)); </code></pre> <p>That gives me a properly indexed array of results, sorted most to least relevant. I can now just drop that in my autocomplete box.</p> <p>Because speed is the whole point of this experiment, here's my results - obviously, they depend partially on my computer.</p> <p>2 input words, 40 results: ~5ms 2 input words, (one single character, one whole) 126 results: ~9ms</p> <p>Obviously there's too many variables at stake for these results to mean much to YOU, but as an example, I think it's pretty impressive.</p> <p>If anyone sees something wrong with the above example, or can think of a better way than that, I'd love to hear about it. The only thing I can think of maybe causing problems right now, is if I were to search for the term <code>lean bimas</code>, I would get the same result score as <code>lima beans</code>, because the pattern isn't conditional based on the previous matches. Because the results I'm looking for and the input strings I'm expecting shouldn't make this happen very often, I've decided to leave it how it is for now, to avoid adding any overhead to this quick little script. However, if I end up feeling like my results are being skewed by it, I'll come back here and post about how I sorted that part.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload