Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Can a human do it?</p> <pre> farsidebag far sidebag farside bag far side bag </pre> <p>Not only do you have to use a dictionary, you might have to use a statistical approach to figure out what's most likely (or, god forbid, an actual HMM for your human language of choice...)</p> <p>For how to do statistics that might be helpful, I turn you to Dr. Peter Norvig, who addresses a different, but related problem of spell-checking <strong><em>in 21 lines of code</em></strong>: <a href="http://norvig.com/spell-correct.html" rel="noreferrer">http://norvig.com/spell-correct.html</a></p> <p>(he does cheat a bit by folding every for loop into a single line.. but still).</p> <p><strong><em>Update</em></strong> This got stuck in my head, so I had to birth it today. This code does a similar split to the one described by Robert Gamble, but then it orders the results based on word frequency in the provided dictionary file (which is now expected to be some text representative of your domain or English in general. I used big.txt from Norvig, linked above, and catted a dictionary to it, to cover missing words).</p> <p>A combination of two words will most of the time beat a combination of 3 words, unless the frequency difference is enormous.</p> <hr> <p><strong>I posted this code with some minor changes on my blog</strong></p> <p><a href="http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/" rel="noreferrer">http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/</a> and also wrote a little about the underflow bug in this code.. I was tempted to just quietly fix it, but figured this may help some folks who haven't seen the log trick before: <a href="http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/" rel="noreferrer">http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/</a></p> <hr> <p>Output on your words, plus a few of my own -- notice what happens with "orcore":</p> <pre> perl splitwords.pl big.txt words answerveal: 2 possibilities - answer veal - answer ve al wickedweather: 4 possibilities - wicked weather - wicked we at her - wick ed weather - wick ed we at her liquidweather: 6 possibilities - liquid weather - liquid we at her - li quid weather - li quid we at her - li qu id weather - li qu id we at her driveourtrucks: 1 possibilities - drive our trucks gocompact: 1 possibilities - go compact slimprojector: 2 possibilities - slim projector - slim project or orcore: 3 possibilities - or core - or co re - orc ore </pre> <p>Code:</p> <pre><code>#!/usr/bin/env perl use strict; use warnings; sub find_matches($); sub find_matches_rec($\@\@); sub find_word_seq_score(@); sub get_word_stats($); sub print_results($@); sub Usage(); our(%DICT,$TOTAL); { my( $dict_file, $word_file ) = @ARGV; ($dict_file &amp;&amp; $word_file) or die(Usage); { my $DICT; ($DICT, $TOTAL) = get_word_stats($dict_file); %DICT = %$DICT; } { open( my $WORDS, '&lt;', $word_file ) or die "unable to open $word_file\n"; foreach my $word (&lt;$WORDS&gt;) { chomp $word; my $arr = find_matches($word); local $_; # Schwartzian Transform my @sorted_arr = map { $_-&gt;[0] } sort { $b-&gt;[1] &lt;=&gt; $a-&gt;[1] } map { [ $_, find_word_seq_score(@$_) ] } @$arr; print_results( $word, @sorted_arr ); } close $WORDS; } } sub find_matches($){ my( $string ) = @_; my @found_parses; my @words; find_matches_rec( $string, @words, @found_parses ); return @found_parses if wantarray; return \@found_parses; } sub find_matches_rec($\@\@){ my( $string, $words_sofar, $found_parses ) = @_; my $length = length $string; unless( $length ){ push @$found_parses, $words_sofar; return @$found_parses if wantarray; return $found_parses; } foreach my $i ( 2..$length ){ my $prefix = substr($string, 0, $i); my $suffix = substr($string, $i, $length-$i); if( exists $DICT{$prefix} ){ my @words = ( @$words_sofar, $prefix ); find_matches_rec( $suffix, @words, @$found_parses ); } } return @$found_parses if wantarray; return $found_parses; } ## Just a simple joint probability ## assumes independence between words, which is obviously untrue ## that's why this is broken out -- feel free to add better brains sub find_word_seq_score(@){ my( @words ) = @_; local $_; my $score = 1; foreach ( @words ){ $score = $score * $DICT{$_} / $TOTAL; } return $score; } sub get_word_stats($){ my ($filename) = @_; open(my $DICT, '&lt;', $filename) or die "unable to open $filename\n"; local $/= undef; local $_; my %dict; my $total = 0; while ( &lt;$DICT&gt; ){ foreach ( split(/\b/, $_) ) { $dict{$_} += 1; $total++; } } close $DICT; return (\%dict, $total); } sub print_results($@){ #( 'word', [qw'test one'], [qw'test two'], ... ) my ($word, @combos) = @_; local $_; my $possible = scalar @combos; print "$word: $possible possibilities\n"; foreach (@combos) { print ' - ', join(' ', @$_), "\n"; } print "\n"; } sub Usage(){ return "$0 /path/to/dictionary /path/to/your_words"; } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload