StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>For a real language, a lexer's the way to go - <a href="https://stackoverflow.com/questions/713559/how-do-i-tokenize-this-string-in-ruby/713608#713608">like Guss said</a>. But if the full language is only as complicated as your example, you can use this quick hack:</p> <pre><code>irb> text = %{Children^10 Health "sanitation management"^5} irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost| { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) } end #=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}] </code></pre> <p>If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.</p> <p>A quick breakdown of the regex:</p> <ul> <li><code>\w+</code> matches any single-term keywords</li> <li><code>(?:\\.|[^\\"]])*</code> uses non-capturing parentheses (<code>(?:...)</code>) to match the contents of an escaped double quoted string - either an escaped symbol (<code>\n</code>, <code>\"</code>, <code>\\</code>, etc.) or any single character that's not an escape symbol or an end quote.</li> <li><code>"((?:\\.|[^\\"]])*)"</code> captures only the contents of a quoted keyword phrase.</li> <li><code>(?:(\w+)|"((?:\\.|[^\\"])*)")</code> matches any keyword - single term or phrase, capturing single terms into <code>$1</code> and phrase contents into <code>$2</code></li> <li><code>\d+</code> matches a number.</li> <li><code>\^(\d+)</code> captures a number following a caret (<code>^</code>). Since this is the third set of capturing parentheses, it will be caputred into <code>$3</code>.</li> <li><code>(?:\^(\d+))?</code> captures a number following a caret if it's there, matches the empty string otherwise.</li> </ul> <p><code>String#scan(regex)</code> matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so <code>$1</code> becomes <code>match[0]</code>, <code>$2</code> becomes <code>match[1]</code>, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a <code>nil</code> entry in the resulting "match".</p> <p>The <code>#map</code> then takes these matches, uses some block magic to break each captured term into different variables (we could have done <code>do |match| ; word,phrase,boost = *match</code>), and then creates your desired hashes. Exactly one of <code>word</code> or <code>phrase</code> will be <code>nil</code>, since both can't be matched against the input, so <code>(word || phrase)</code> will return the non-<code>nil</code> one, and <code>#downcase</code> will convert it to all lowercase. <code>boost.to_i</code> will convert a string to an integer while <code>(boost.nil? ? nil : boost.to_i)</code> will ensure that <code>nil</code> boosts stay <code>nil</code>.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload