StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
3239442
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
3
CommunityOwnedDate
CreationDate
2010-07-13T16:51:29.053
FavoriteCount
0
LastActivityDate
2010-07-13T16:51:29.053
LastEditDate
LastEditorUserId
0
OwnerUserId
373861
ParentId
3230305
PostTypeId
2
Score
13
ViewCount
0
LastEditorDisplayName
text
Body
Frankly, I wasn't expecting a solution in near future and was about to give up, but some how I stumbled on this page: <a href="http://en.wikipedia.org/wiki/MIME#Multipart_messages" rel="noreferrer">http://en.wikipedia.org/wiki/MIME#Multipart_messages</a> <a href="http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx" rel="noreferrer">http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx</a> Although, not a very catchy in first look. But if you look carefully you will get clue. After reading this I fired up my IE and at random started saving pages as <code>*.mht</code> file. Let me go line by line... But let me explain beforehand that my ultimate goal was to separate/extract out the <code>html</code> content and parse it... the solution is not complete in itself as it depends on the <code>character set</code> or <code>encoding</code> I choose while saving. But even though it will extract the individual files with minor hitches... I hope this will be useful for anyone who is trying to parse/decompress <code>*.mht/MHTML</code> files :) ======= Explanation ======== ** Taken from a mht file ** <pre><code>From: "Saved by Windows Internet Explorer 7" </code></pre> It is the software used for saving the file <pre><code>Subject: Google Date: Tue, 13 Jul 2010 21:23:03 +0530 MIME-Version: 1.0 </code></pre> Subject, date and mime-version … much like the mail format <pre><code> Content-Type: multipart/related; type="text/html"; </code></pre> This is the part which tells us that it is a <code>multipart</code> document. A multipart document has one or more different sets of data combined in a single body, a <code>multipart</code> Content-Type field must appear in the entity's header. Here, we can also see the type as <code>"text/html"</code>. <pre><code>boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0" </code></pre> Out of all this is the most important part. This is the unique delimiter which divides two different parts (html,images,css,script etc). Once you get hold of this, everything gets easy... Now, I just have to iterate through the document and finding out different sections and saving them as per their <code>Content-Transfer-Encoding</code> (base64, quoted-printable etc) ... . . . SAMPLE <pre><code> ------=_NextPart_000_0007_01CB22D1.93BBD1A0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" = . . . </code></pre> ** JAVA CODE ** An interface for defining constants. <pre><code>public interface IConstants { public String BOUNDARY = "boundary"; public String CHAR_SET = "charset"; public String CONTENT_TYPE = "Content-Type"; public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding"; public String CONTENT_LOCATION = "Content-Location"; public String UTF8_BOM = "=EF=BB=BF"; public String UTF16_BOM1 = "=FF=FE"; public String UTF16_BOM2 = "=FE=FF"; } </code></pre> The main parser class... <pre><code>/** * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0 * which accompanies this distribution, and is available at * http://www.eclipse.org/legal/epl-v10.html */ package com.test.mht.core; import java.io.BufferedOutputStream; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileOutputStream; import java.io.FileReader; import java.io.OutputStreamWriter; import java.util.regex.Matcher; import java.util.regex.Pattern; import sun.misc.BASE64Decoder; /** * File to parse and decompose *.mts file in its constituting parts. * @author Manish Shukla */ public class MHTParser implements IConstants { private File mhtFile; private File outputFolder; public MHTParser(File mhtFile, File outputFolder) { this.mhtFile = mhtFile; this.outputFolder = outputFolder; } /** * @throws Exception */ public void decompress() throws Exception { BufferedReader reader = null; String type = ""; String encoding = ""; String location = ""; String filename = ""; String charset = "utf-8"; StringBuilder buffer = null; try { reader = new BufferedReader(new FileReader(mhtFile)); final String boundary = getBoundary(reader); if(boundary == null) throw new Exception("Failed to find document 'boundary'... Aborting"); String line = null; int i = 1; while((line = reader.readLine()) != null) { String temp = line.trim(); if(temp.contains(boundary)) { if(buffer != null) { writeBufferContentToFile(buffer,encoding,filename,charset); buffer = null; } buffer = new StringBuilder(); }else if(temp.startsWith(CONTENT_TYPE)) { type = getType(temp); }else if(temp.startsWith(CHAR_SET)) { charset = getCharSet(temp); }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) { encoding = getEncoding(temp); }else if(temp.startsWith(CONTENT_LOCATION)) { location = temp.substring(temp.indexOf(":")+1).trim(); i++; filename = getFileName(location,type); }else { if(buffer != null) { buffer.append(line + "\n"); } } } }finally { if(null != reader) reader.close(); } } private String getCharSet(String temp) { String t = temp.split("=")[1].trim(); return t.substring(1, t.length()-1); } /** * Save the file as per character set and encoding */ private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) throws Exception { if(!outputFolder.exists()) outputFolder.mkdirs(); byte[] content = null; boolean text = true; if(encoding.equalsIgnoreCase("base64")){ content = getBase64EncodedString(buffer); text = false; }else if(encoding.equalsIgnoreCase("quoted-printable")) { content = getQuotedPrintableString(buffer); } else content = buffer.toString().getBytes(); if(!text) { BufferedOutputStream bos = null; try { bos = new BufferedOutputStream(new FileOutputStream(filename)); bos.write(content); bos.flush(); }finally { bos.close(); } }else { BufferedWriter bw = null; try { bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset)); bw.write(new String(content)); bw.flush(); }finally { bw.close(); } } } /** * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF' * @see http://en.wikipedia.org/wiki/Byte_order_mark */ private byte[] getQuotedPrintableString(StringBuilder buffer) { //Set<String> uniqueHex = new HashSet<String>(); //final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*"); String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", ""); //Matcher m = p.matcher(temp); //while(m.find()) { // uniqueHex.add(m.group()); //} //System.out.println(uniqueHex); //for (String hex : uniqueHex) { //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1))); //} return temp.getBytes(); } /*private String getASCIIValue(String hex) { return ""+(char)Integer.parseInt(hex, 16); }*/ /** * Although system dependent..it works well */ private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception { return new BASE64Decoder().decodeBuffer(buffer.toString()); } /** * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL. * Otherwise it returns 'unknown.<type>' */ private String getFileName(String location, String type) { final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+"); String ext = ""; String name = ""; if(type.toLowerCase().endsWith("jpeg")) ext = "jpg"; else ext = type.split("/")[1]; if(location.endsWith("/")) { name = "main"; }else { name = location.substring(location.lastIndexOf("/") + 1); Matcher m = p.matcher(name); String fname = ""; while(m.find()) { fname = m.group(); } if(fname.trim().length() == 0) name = "unknown"; else return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length())); } return getUniqueName(name,ext); } /** * Returns a qualified unique output file path for the parsed path. * In case the file already exist it appends a numarical value a continues */ private String getUniqueName(String name,String ext) { int i = 1; File file = new File(outputFolder,name + "." + ext); if(file.exists()) { while(true) { file = new File(outputFolder, name + i + "." + ext); if(!file.exists()) return file.getAbsolutePath(); i++; } } return file.getAbsolutePath(); } private String getType(String line) { return splitUsingColonSpace(line); } private String getEncoding(String line){ return splitUsingColonSpace(line); } private String splitUsingColonSpace(String line) { return line.split(":\\s*")[1].replaceAll(";", ""); } /** * Gives you the boundary string */ private String getBoundary(BufferedReader reader) throws Exception { String line = null; while((line = reader.readLine()) != null) { line = line.trim(); if(line.startsWith(BOUNDARY)) { return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\"")); } } return null; } } </code></pre> Regards,
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POHow to read or parse MHTML (.mht) files in java
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USFavonius
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POHow to read or parse MHTML (.mht) files in java
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COThank you for sharing this. I have a question though: How do you determine the end of a section HEADER? I mean, the section starts with the boundary defined somewhere above, then there come header stuff like type/encoding/location etc, and after this it seems to me that every header is delimited by ASCII 0D0A0D0A (two times CR+LF). Am I right about this, or is there any other guaranteed delimiter?
 singulars
 PostPostId
 PO
 UserUserId
 USHasib. A. Samad

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.