Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>First of all, note that there are no functions in Lua's <code>string</code> library that know anything about Unicode/mutlibyte encodings (source: Programming in Lua, 3rd edition). As far as Lua is concerned, strings are simply made up of bytes. It's up to you to figure out which bytes make up a character, if you are using UTF-8 encoded strings. Therefore, <code>string.len</code> will give you the number of <em>bytes</em>, not the number of <em>characters</em>. And <code>string.sub</code> will give you a substring of <em>bytes</em> not a substring of <em>characters</em>. </p> <p><strong>Some UTF-8 basics:</strong></p> <p>If you need some refreshing on the conceptual basics of Unicode, you should check out <a href="http://www.joelonsoftware.com/articles/Unicode.html">this article</a>.</p> <p>UTF-8 is one possible (and very important) implementation of Unicode - and probably the one you are dealing with. As opposed to UTF-32 and UTF-16 it uses a variable number of bytes (from 1 to 4) to encode each character. In particular, the ASCII characters 0 to 127 are represented with a single byte, so that ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you only use those 128 characters). All other characters start with a byte in the range from 194 to 244 (which signals that more bytes follow to encode a full character). This range is further subdivided, so that you can tell from this byte, whether 1, 2 or 3 more bytes follow. Those additional bytes are called continuation bytes and are guaranteed to be only taken from the range from 128 to 191. Therefore, by looking at a single byte we know where it stands in a character:</p> <ul> <li>If it's in <code>[0,127]</code>, it's a single-byte (ASCII) character</li> <li>If it's in <code>[128,191]</code>, it's part of a longer character and meaningless on its own</li> <li>If it's in <code>[191,244]</code>, it marks the beginning of a longer character (and tells us how long that character is)</li> </ul> <p>This information is enough to count characters, split a UTF-8 string into characters and do all sorts of other UTF-8-sensitive manipulations.</p> <p><strong>Some pattern matching basics:</strong></p> <p>For the task at hand we need a few of Lua's pattern matching constructs:</p> <p><code>[...]</code> is a character class, that matches a single character (or rather <em>byte</em>) of those inside the class. E.g. <code>[abc]</code> matches either <code>a</code>, or <code>b</code> or <code>c</code>. You can define ranges using a hyphen. Therefore <code>[\33-\127]</code> for example, matches any single one of the bytes from <code>33</code> to <code>127</code>. Note that <code>\127</code> is an escape sequence you can use in <em>any</em> Lua string (not just patterns) to specify a byte by its numerical value instead of the corresponding ASCII character. For instance, <code>"a"</code> is the same as <code>"\97"</code>.</p> <p>You can negate a character class, by starting it with <code>^</code> (so that it matches any single byte that is <em>not</em> part of the class.</p> <p><code>*</code> repeats the previous token 0 or more times (arbitrarily many times - as often as possible).</p> <p><code>$</code> is an anchor. If it's the last character of the pattern, the pattern will only match at the end of the string.</p> <p><strong>Combining all of that...</strong></p> <p>...your problem reduces to a one-liner:</p> <pre><code>local function lastChar(s) return string.match(s, "[^\128-\191][\128-\191]*$") end </code></pre> <p>This will match a character that is not a UTF-8 continuation character (i.e., that is either single-byte character, or a byte that marks the beginning of a longer character). Then it matches an arbitrary number of continuation characters (this cannot go past the current character, due to the range chosen), followed by the end of the string (<code>$</code>). Therefore, this will give you all the bytes that make up the last character in the string. It produces the desired output for all 4 of your examples.</p> <p>Equivalently, you can use <code>gsub</code> to remove that last character from your string:</p> <pre><code>function deleteLastCharacter(s) return string.gsub(s, "[^\128-\191][\128-\191]*$", "") end </code></pre> <p>The match is the same, but instead of returning the matched substring, we replace it with <code>""</code> (i.e. remove it) and return the modified string.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload