Note that there are some explanatory texts on larger screens.

plurals
  1. POPHP Extract Similar Parts from Multiple Strings
    text
    copied!<p>I'm trying to extract the parts which are similar from multiple strings.</p> <p>The purpose of this is an attempt to extract the title of a book from multiple OCRings of the title page.</p> <p>This applies to only the beginning of the string, the ends of the strings don't need to be trimmed and can stay as they are.</p> <p>For example, my strings might be:</p> <pre><code>$title[0]='the history of the internet, expanded and revised'; $title[1]='the history of the internet'; $title[2]='published by xyz publisher the historv of the internot, expanded and'; $title[3]='history of the internet'; </code></pre> <p>So basically I would want to trim each string so that it starts at the most probable starting point. Considering that there may be OCR errors (e.g. "historv", "internot") I thought it might be best to take the number of characters from each word, which would give me an array for each string (so a multi-dimensional array) with a the length of each word. This can then be used to find running matches and trim the beginnings of the string to the most likely.</p> <p>The strings should be cut to:</p> <pre><code>$title[0]='the history of the internet, expanded and revised'; $title[1]='the history of the internet'; $title[2]='the historv of the internot, expanded and'; $title[3]='XXX history of the internet'; </code></pre> <p>So I need to be able to recognize that "history of the internet" (7 2 3 8) is the run which matches all strings, and that the preceding "the" is most probably correct seeing as it occurs in >50% of the strings, and therefore the beginning of each string is trimmed to "the" and a placeholder of the same length is added onto the string missing "the".</p> <p>So far I have got:</p> <pre><code>function CompareSimilarStrings($array) { $n=count($array); // Get length of each word in each string &gt; for($run=0; $run&lt;$n; $run++) { $temp=explode(' ',$array[$run]); foreach($temp as $key =&gt; $val) $len[$run][$key]=strlen($val); } for($run=0; $run&lt;$n; $run++) { } } </code></pre> <p>As you can see, I'm stuck on finding the running matches.</p> <p>Any ideas?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload