Note that there are some explanatory texts on larger screens.

plurals
  1. POdecode a file stream using UTF-8
    primarykey
    data
    text
    <p>I have a XML document, which is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding. </p> <p>Any ideas to have a quick check without reading the whole file into memory in the form of <code>byte[]</code>?</p> <p>I am using VSTS 2008 and C#.</p> <p>When using <code>XMLDocument</code> to load an XML document, which contains invalid byte sequences, there is an exception, but when reading all content into a byte array and then checking against UTF-8, there is no exception, any ideas?</p> <p>Here is a screenshot showing the content of my XML file, or you can download a copy of the file from <a href="http://www.filefactory.com/file/ag00da3/n/a_xml" rel="nofollow noreferrer">here</a></p> <p><a href="https://i.stack.imgur.com/cPlhO.jpg" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/cPlhO.jpg" alt="enter image description here"></a></p> <p><strong>EDIT 1:</strong></p> <pre><code>class Program { public static byte[] RawReadingTest(string fileName) { byte[] buff = null; try { FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read); BinaryReader br = new BinaryReader(fs); long numBytes = new FileInfo(fileName).Length; buff = br.ReadBytes((int)numBytes); } catch (Exception ex) { Console.WriteLine(ex.Message); } return buff; } static void XMLTest() { try { XmlDocument xDoc = new XmlDocument(); xDoc.Load("c:\\abc.xml"); } catch (Exception ex) { Console.WriteLine(ex.Message); } } static void Main() { try { XMLTest(); Encoding ae = Encoding.GetEncoding("utf-8"); string filename = "c:\\abc.xml"; ae.GetString(RawReadingTest(filename)); } catch (Exception ex) { Console.WriteLine(ex.Message); } return; } } </code></pre> <p><strong>EDIT 2:</strong> When using <code>new UTF8Encoding(true, true)</code> there will be an exception, but when using <code>new UTF8Encoding(false, true)</code>, there is no exception thrown. I am confused, because it should be the 2nd parameter which controls whether an exception is thrown (if there are invalid byte sequences), why the 1st parameter matters?</p> <pre><code> public static void TestTextReader2() { try { // Create an instance of StreamReader to read from a file. // The using statement also closes the StreamReader. using (StreamReader sr = new StreamReader( "c:\\a.xml", new UTF8Encoding(true, true) )) { int bufferSize = 10 * 1024 * 1024; //could be anything char[] buffer = new char[bufferSize]; // Read from the file until the end of the file is reached. int actualsize = sr.Read(buffer, 0, bufferSize); while (actualsize &gt; 0) { actualsize = sr.Read(buffer, 0, bufferSize); } } } catch (Exception e) { // Let the user know what went wrong. Console.WriteLine("The file could not be read:"); Console.WriteLine(e.Message); } } </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload