Note that there are some explanatory texts on larger screens.

plurals
  1. POTagging similar sentences with lower time complexity than n^2
    primarykey
    data
    text
    <p>This is my first post, have been a lurker for a long time, so will try my best to explain myself here.</p> <p>I have been using lowest common substring method along with basic word match and substring match(regexp) for clustering similar stories on the net. But the problem is its time complexity is n^2 (I compare each title to all the others). I've done very basic optimizations like storing and skipping all the matched titles.</p> <p>What I want is some kind of preprocessing of the chunk of text so that for each iteration i reduce number of posts to match to. Any further optimizations are also welcome.</p> <p>Here are the functions i use for the same. the main function which calls them first calls word_match, if more than 70% of the word matches i further go down and call 'substring_match' and LCSubstr_len. The code is in Python, I can use C as well</p> <pre><code>import re def substring_match(a,b): try: c = re.match(a,b) return c if c else True if re.match(b,a) else False except: return False def LCSubstr_len(S, T): m = len(S); n = len(T) L = [[0] * (n+1) for i in xrange(m+1)] lcs = 0 for i in xrange(m): for j in xrange(n): if S[i] == T[j]: L[i+1][j+1] = L[i][j] + 1 lcs = max(lcs, L[i+1][j+1]) else: L[i+1][j+1] = max(L[i+1][j], L[i][j+1]) return lcs/((float(m+n)/2)) def word_match(str1,str2): matched = 0 try: str1,str2 = str(str1),str(str2) assert isinstance(str1,str) except: return 0.0 words1 = str1.split(None) words2 = str2.split(None) for i in words1: for j in words2: if i.strip() ==j.strip(): matched +=1 len1 = len(words1) len2 = len(words2) perc_match = float(matched)/float((len1+len2)/2) return perc_match </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload