Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Some observations on your solution:</p> <ol> <li>'u' at the end of your pattern means that the <strong>pattern</strong>, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?). </li> <li>\w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.</li> <li>The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:</li> </ol> <blockquote> <blockquote> <p>A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.</p> </blockquote> </blockquote> <h3>Creating the slug</h3> <p>You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.</p> <p>So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: <a href="https://web.archive.org/web/20130208144021/http://neo22s.com/slug" rel="noreferrer">https://web.archive.org/web/20130208144021/http://neo22s.com/slug</a></p> <h3>Sanitization in general</h3> <p>OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application. </p> <p>The Encoder interface provides:</p> <pre><code>canonicalize (string $input, [bool $strict = true]) decodeFromBase64 (string $input) decodeFromURL (string $input) encodeForBase64 (string $input, [bool $wrap = false]) encodeForCSS (string $input) encodeForHTML (string $input) encodeForHTMLAttribute (string $input) encodeForJavaScript (string $input) encodeForOS (Codec $codec, string $input) encodeForSQL (Codec $codec, string $input) encodeForURL (string $input) encodeForVBScript (string $input) encodeForXML (string $input) encodeForXMLAttribute (string $input) encodeForXPath (string $input) </code></pre> <p><a href="https://github.com/OWASP/PHP-ESAPI" rel="noreferrer">https://github.com/OWASP/PHP-ESAPI</a> <a href="https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API" rel="noreferrer">https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. COYou are correct about my assumption of the "u" modifier - I thought that it was for the text. I also forgot about the \w modifier including the underscore. I would normally convert all accented characters to ASCII - but I want this to work for other languages as well. I was assuming that there would be some kind of UTF-8 safe way that any character of a language could be used in a URL slug or filename so that even Arabic titles would work. After all, linux supports UTF-8 filenames and browsers *should* encode HTML links as needed. Big thanks for your input here.
      singulars
    2. COOn second thought, you're actually right, but it's not just an issue with the browser encoding the links correctly. The easiest way to achieve close to what you want is to map non-ASCII characters to their closest ASCII equivalent and then URL-encode your link in the HTML body. The hard way is to ensure consistent UTF-8 encoding (or UTF-16, I think for some Chinese dialects) from your data store, through your webserver, application layer (PHP), page content, web browser and **not** urlencode your urls (but still strip 'undesirable' chars). This will give you nice non-encoded links and URLs.
      singulars
    3. COGood advice. I'm going to try to create a pure UTF-8 environment. Then, taking a several strings from non-ASCII languages, I'll remove dangerous chars (./;:etc...) and creating files and then HTML links to those files to see if I can click them and see if all this works. If not then I'll probably have to drop back to (raw)?urlencode() to allow UTF-8. I'll post back results here.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload