Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>A long time ago i wrote a script which download a pdf and convert it into text. This function do the convertion:</p> <pre><code>function pdf2string($sourcefile) { $content = $sourcefile; $searchstart = 'stream'; $searchend = 'endstream'; $pdfText = ''; $pos = 0; $pos2 = 0; $startpos = 0; while ($pos !== false &amp;&amp; $pos2 !== false) { $pos = strpos($content, $searchstart, $startpos); $pos2 = strpos($content, $searchend, $startpos + 1); if ($pos !== false &amp;&amp; $pos2 !== false){ if ($content[$pos] == 0x0d &amp;&amp; $content[$pos + 1] == 0x0a) { $pos += 2; } else if ($content[$pos] == 0x0a) { $pos++; } if ($content[$pos2 - 2] == 0x0d &amp;&amp; $content[$pos2 - 1] == 0x0a) { $pos2 -= 2; } else if ($content[$pos2 - 1] == 0x0a) { $pos2--; } $textsection = substr( $content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1 ); $data = gzuncompress($textsection); $pdfText .= pdfExtractText($data); $startpos = $pos2 + strlen($searchend) - 1; } } return preg_replace('/(\s)+/', ' ', $pdfText); } </code></pre> <p>EDIT: I call <code>pdfExtractText()</code> This function is defined here:</p> <pre><code>function pdfExtractText($psData){ if (!is_string($psData)) { return ''; } $text = ''; // Handle brackets in the text stream that could be mistaken for // the end of a text field. I'm sure you can do this as part of the // regular expression, but my skills aren't good enough yet. $psData = str_replace('\)', '##ENDBRACKET##', $psData); $psData = str_replace('\]', '##ENDSBRACKET##', $psData); preg_match_all( '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si', $psData, $matches ); for ($i = 0; $i &lt; sizeof($matches[0]); $i++) { if ($matches[3][$i] != '') { // Run another match over the contents. preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches); foreach ($subMatches[1] as $subMatch) { $text .= $subMatch; } } else if ($matches[4][$i] != '') { $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i]; } } // Translate special characters and put back brackets. $trans = array( '...' =&gt; '…', '\205' =&gt; '…', '\221' =&gt; chr(145), '\222' =&gt; chr(146), '\223' =&gt; chr(147), '\224' =&gt; chr(148), '\226' =&gt; '-', '\267' =&gt; '•', '\374' =&gt; 'ü', '\344' =&gt; 'ä', '\247' =&gt; '§', '\366' =&gt; 'ö', '\337' =&gt; 'ß', '\334' =&gt; 'Ü', '\326' =&gt; 'Ö', '\304' =&gt; 'Ä', '\(' =&gt; '(', '\[' =&gt; '[', '##ENDBRACKET##' =&gt; ')', '##ENDSBRACKET##' =&gt; ']', chr(133) =&gt; '-', chr(141) =&gt; chr(147), chr(142) =&gt; chr(148), chr(143) =&gt; chr(145), chr(144) =&gt; chr(146), ); $text = strtr($text, $trans); return $text; } </code></pre> <p>EDIT2: To get content from a local file use:</p> <pre><code>$fp = fopen($sourcefile, 'rb'); $content = fread($fp, filesize($sourcefile)); fclose($fp); </code></pre> <p>EDIT3: Before saving data to db i use an escape function:</p> <pre><code>function escape($str) { $search=array("\\","\0","\n","\r","\x1a","'",'"'); $replace=array("\\\\","\\0","\\n","\\r","\Z","\'",'\"'); return str_replace($search,$replace,$str); } </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload