Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to read or parse MHTML (.mht) files in java
    primarykey
    data
    text
    <p>I need to mine the <strong>content</strong> of most of known document files like:</p> <ol> <li>pdf</li> <li>html</li> <li>doc/docx etc.</li> </ol> <p>For most of these file formats I am planning to use:</p> <p><a href="http://tika.apache.org/" rel="noreferrer">http://tika.apache.org/</a></p> <p>But as of now <code>Tika</code> does not support MHTML (*.mht) files.. ( <a href="http://en.wikipedia.org/wiki/MHTML" rel="noreferrer">http://en.wikipedia.org/wiki/MHTML</a> ) There are few examples in C# ( <a href="http://www.codeproject.com/KB/files/MhtBuilder.aspx" rel="noreferrer">http://www.codeproject.com/KB/files/MhtBuilder.aspx</a> ) but I found none in Java.</p> <p>I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...</p> <p>As per MSDN page ( <a href="http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content" rel="noreferrer">http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content</a> ) and the <code>code project</code> page i mentioned earlier ... mht files use GZip compression .... </p> <p>Attempting to decompress in java results in following exceptions: With <code>java.uti.zip.GZIPInputStream</code></p> <pre><code>java.io.IOException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(Unknown Source) at java.util.zip.GZIPInputStream.&lt;init&gt;(Unknown Source) at java.util.zip.GZIPInputStream.&lt;init&gt;(Unknown Source) at GZipTest.main(GZipTest.java:16) </code></pre> <p>And with <code>java.util.zip.ZipFile</code></p> <pre><code> java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.&lt;init&gt;(Unknown Source) at java.util.zip.ZipFile.&lt;init&gt;(Unknown Source) at GZipTest.main(GZipTest.java:21) </code></pre> <p>Kindly suggest how to decompress it.... </p> <p>Thanks....</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload