StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I realize this is really belated, but I just stumbled onto this incredible flaw in <code>StreamReader</code> myself; the fact that you can't reliably seek when using <code>StreamReader</code>. Personally, my specific need is to have the ability to read characters, but then "back up" if a certain condition is met; it's a side effect of one of the file formats I'm parsing. </p> <p>Using <code>ReadLine()</code> isn't an option because it's only useful in really trivial parsing jobs. I have to support configurable record/line delimiter sequences and support escape delimiter sequences. Also, I don't want to implement my own buffer so I can support "backing up" and escape sequences; that should be the <code>StreamReader</code>'s job.</p> <p>This method calculates the actual position in the underlying stream of bytes on-demand. It works for UTF8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and any single-byte encoding (e.g. code pages 1252, 437, 28591, etc.), regardless the presence of a preamble/BOM. This version will not work for UTF-7, Shift-JIS, or other variable-byte encodings.</p> <p>When I need to seek to an arbitrary position in the underlying stream, I directly set <code>BaseStream.Position</code> and then call <code>DiscardBufferedData()</code> to get <code>StreamReader</code> back in sync for the next <code>Read()</code>/<code>Peek()</code> call. </p> <p>And a friendly reminder: don't arbitrarily set <code>BaseStream.Position</code>. If you bisect a character, you'll invalidate the next <code>Read()</code> and, for UTF-16/-32, you'll also invalidate the result of this method.</p> <pre><code>public static long GetActualPosition(StreamReader reader) { System.Reflection.BindingFlags flags = System.Reflection.BindingFlags.DeclaredOnly | System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.GetField; // The current buffer of decoded characters char[] charBuffer = (char[])reader.GetType().InvokeMember("charBuffer", flags, null, reader, null); // The index of the next char to be read from charBuffer int charPos = (int)reader.GetType().InvokeMember("charPos", flags, null, reader, null); // The number of decoded chars presently used in charBuffer int charLen = (int)reader.GetType().InvokeMember("charLen", flags, null, reader, null); // The current buffer of read bytes (byteBuffer.Length = 1024; this is critical). byte[] byteBuffer = (byte[])reader.GetType().InvokeMember("byteBuffer", flags, null, reader, null); // The number of bytes read while advancing reader.BaseStream.Position to (re)fill charBuffer int byteLen = (int)reader.GetType().InvokeMember("byteLen", flags, null, reader, null); // The number of bytes the remaining chars use in the original encoding. int numBytesLeft = reader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos); // For variable-byte encodings, deal with partial chars at the end of the buffer int numFragments = 0; if (byteLen > 0 && !reader.CurrentEncoding.IsSingleByte) { if (reader.CurrentEncoding.CodePage == 65001) // UTF-8 { byte byteCountMask = 0; while ((byteBuffer[byteLen - numFragments - 1] >> 6) == 2) // if the byte is "10xx xxxx", it's a continuation-byte byteCountMask |= (byte)(1 << ++numFragments); // count bytes & build the "complete char" mask if ((byteBuffer[byteLen - numFragments - 1] >> 6) == 3) // if the byte is "11xx xxxx", it starts a multi-byte char. byteCountMask |= (byte)(1 << ++numFragments); // count bytes & build the "complete char" mask // see if we found as many bytes as the leading-byte says to expect if (numFragments > 1 && ((byteBuffer[byteLen - numFragments] >> 7 - numFragments) == byteCountMask)) numFragments = 0; // no partial-char in the byte-buffer to account for } else if (reader.CurrentEncoding.CodePage == 1200) // UTF-16LE { if (byteBuffer[byteLen - 1] >= 0xd8) // high-surrogate numFragments = 2; // account for the partial character } else if (reader.CurrentEncoding.CodePage == 1201) // UTF-16BE { if (byteBuffer[byteLen - 2] >= 0xd8) // high-surrogate numFragments = 2; // account for the partial character } } return reader.BaseStream.Position - numBytesLeft - numFragments; } </code></pre> <p>Of course, this uses Reflection to get at private variables, so there is risk involved. However, this method works with .Net 2.0, 3.0, 3.5, 4.0, 4.0.3, 4.5, 4.5.1, 4.5.2, 4.6, and 4.6.1. Beyond that risk, the only other critical assumption is that the underlying byte-buffer is a <code>byte[1024]</code>; if Microsoft changes it the wrong way, the method breaks for UTF-16/-32.</p> <p>This has been tested against a UTF-8 file filled with <code>Ažテ</code> (10 bytes: <code>0x41 C5 BE E3 83 86 F0 A3 98 BA</code>) and a UTF-16 file filled with <code>A</code> (6 bytes: <code>0x41 00 01 D8 37 DC</code>). The point being to force-fragment characters along the <code>byte[1024]</code> boundaries, all the different ways they could be.</p> <p><strong>UPDATE (2013-07-03)</strong>: I fixed the method, which originally used the broken code from that other answer. This version has been tested against data containing a characters requiring use of surrogate pairs. The data was put into 3 files, each with a different encoding; one UTF-8, one UTF-16LE, and one UTF-16BE.</p> <p><strong>UPDATE (2016-02)</strong>: The only correct way to handle bisected characters is to directly interpret the underlying bytes. UTF-8 is properly handled, and UTF-16/-32 work (given the length of byteBuffer).</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload