Note that there are some explanatory texts on larger screens.

plurals
  1. PORegEx to parse or validate Base64 data
    primarykey
    data
    text
    <p>Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.<p></p> <p>I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.</p> <p>So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.</p> <pre><code>Content-Transfer-Encoding: base64 VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu </code></pre> <p>Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.</p> <pre><code>Content-Transfer-Encoding: base64 http://www.stackoverflow.com VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu </code></pre> <p>This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will. <p> My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!</p> <pre><code>[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D 2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8 </code></pre> <p>Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the <em>best</em> solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?</p> <hr> <h2><strong>UPDATE:</strong></h2> <p>Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by <a href="https://stackoverflow.com/users/53114/gumbo/" title="Gumbo">Gumbo</a> the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.</p> <pre><code>[^-A-Za-z0-9+/=]|=[^=]|={3,}$ </code></pre> <p>And yes, this is just for a <em>STRING</em> of Base64 data, NOT a properly formatted <a href="http://tools.ietf.org/html/rfc1341" rel="noreferrer" title="RFC1341">RFC1341</a> message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is <em>highly</em> recommend that you read <a href="http://tools.ietf.org/html/rfc4648" rel="noreferrer" title="RFC4648">RFC4648</a> that <a href="https://stackoverflow.com/users/53114/gumbo/" title="Gumbo">Gumbo</a> mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload