StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POIdeal functional language for implementing a full text search with .NET
text
Body
copied!<p>During my studies of computer science I came accross some functional languages like Prolog but now I have only been doing imperative stuff like C#, Ruby JavaScript and Java for the past 10 years. Currently I am creating a full text search engine for an online shop and I have come quite far the "imperative way" already. But having stumbled accross some functional languages like Haskell of Clojure it became clear that the functional paradigm fits so much better and that the imperative way is just not the right tool for this job. </p> <p>So we have a full text index of some 10 million records. Each record basically contains a word occurrence, along with the id and text position from the record of which it originates. </p> <p>When the user enters a search string it is parsed into an expression tree. For example the search string "transformer 100 W" results in something like </p> <pre><code>AND('transformer%', OR(NEAR('100', 'W'), NEAR('100', 'watts'), '100W', '0.1kW')) </code></pre> <p>There is some extra "intelligence" going on here, but that is of no concern for this question.</p> <p>The expression tree is then evaluated recursively and results in a couple of sql queries that could return up to 100,000 rows in the form of .NET-DataTables. These are then read into sets or dictionaries and, depending on the predicates, intersections and unions are applied in order to find all results that match the entire search expression. For the NEAR evaluation the position indexes of the found occurrences are compared as well. But this is all done imperatively, with a lot of for-loops.</p> <p>Additionally there is a ranking function that adds up scores of the found word occurrences. Words that are only found as prefixes or with fuzzy matching (done by the database server) get lower scores than precise matches.</p> <p>For each resulting item I also need to get a list of all word occurrences that were matched, in order to highlight these words in the result pages.</p> <p>So roughly the evaluation algorithm is a function like</p> <pre><code>expression tree, full text index -> resulting items that match the expressin tree, each with a ranking sum and a list of all found word occurrences for this item </code></pre> <p>I am just giving a rough overview here, but I hope you get enough of a picture.</p> <p>Now the "real world" constraints:</p> <ul> <li>The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.</li> <li>Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).</li> <li>Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.</li> <li>The language (and the compiler) should be mature.</li> </ul> <p>Since I need to stay with .NET, I was looking into Clojure-CLR, F# and Scala for .NET. </p> <p>I like the concepts of Clojure a lot, but right now I can not assess, whether it would be up to the job. Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it. I haven't delved into Scala much yet, but it seems to be well established.</p> <p>Any insights would be welcome!</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload