Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>This is in response to your original question, and your later <a href="https://stackoverflow.com/questions/3315910/php-word-index-performance-and-reasonable-results#answer-3316529">answer/question</a>.</p> <p>I've used the <a href="http://www.sphinxsearch.com/" rel="nofollow noreferrer">Sphinx</a> search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.</p> <p>I'm sure there are other ways to do this, both with your own custom code, or with other search engines&#8212;Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.</p> <p>I recommend reading <a href="http://www.ibm.com/developerworks/library/os-php-sphinxsearch/" rel="nofollow noreferrer">Build a custom search engine with PHP</a> before digging into the <a href="http://www.sphinxsearch.com/docs/current.html" rel="nofollow noreferrer">Sphinx documentation</a>. If you don't think it's suitable after reading that, fair enough.</p> <p>In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:</p> <p><strong>filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)</strong></p> <p><a href="http://www.sphinxsearch.com/docs/current.html#conf-stopwords" rel="nofollow noreferrer">11.2.8. stopwords</a> </p> <blockquote> <p>Stopwords are the words that will not be indexed. Typically you'd put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.</p> </blockquote> <p><strong>With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?</strong></p> <p><a href="http://www.sphinxsearch.com/docs/current.html#conf-wordforms" rel="nofollow noreferrer">11.2.9. wordforms</a> </p> <blockquote> <p>Word forms are applied after tokenizing the incoming text by charset_table rules. They essentialy let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (eg. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.</p> </blockquote> <p><strong>Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)</strong></p> <p>Sphinx supports the <a href="http://tartarus.org/~martin/PorterStemmer/index.html" rel="nofollow noreferrer">Porter Stemming Algorithm</a> </p> <blockquote> <p>The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.</p> </blockquote> <p><strong>Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?</strong></p> <p><a href="http://www.sphinxsearch.com/docs/current.html#attributes" rel="nofollow noreferrer">3.2. Attributes</a> </p> <blockquote> <p>A good example for attributes would be a forum posts table. Assume that only title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (ie. search only those rows that have some specific values of author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.</p> <p>This can be achieved by specifying all the mentioned columns (excluding title and content, that are full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and grouping.</p> </blockquote> <p>You can also use the <a href="http://www.sphinxsearch.com/docs/current.html#extended-syntax" rel="nofollow noreferrer">5.3. Extended query syntax</a> to search specific fields (as opposed to filtering results by attributes):</p> <blockquote> <p>field search operator: @vendor intel</p> </blockquote> <p><strong>How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?</strong></p> <p><a href="http://www.sphinxsearch.com/docs/current.html#api-func-query" rel="nofollow noreferrer">8.6.1. Query</a> </p> <blockquote> <p>On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:</p> <p>"matches":<br> Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).</p> <p>"total":<br> Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.</p> <p>"total_found":<br> Total amount of matching documents in index (that were found and procesed on server).</p> <p>"words":<br> Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").</p> <p>"error":<br> Query error message reported by searchd (string, human readable). Empty if there were no errors.</p> <p>"warning":<br> Query warning message reported by searchd (string, human readable). Empty if there were no warnings.</p> </blockquote> <p>Also see <a href="http://www.ibm.com/developerworks/library/os-php-sphinxsearch/#list11" rel="nofollow noreferrer">Listing 11</a> and <a href="http://www.ibm.com/developerworks/library/os-php-sphinxsearch/#list13" rel="nofollow noreferrer">Listing 13</a> from <a href="http://www.ibm.com/developerworks/library/os-php-sphinxsearch/" rel="nofollow noreferrer">Build a custom search engine with PHP</a>.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload