sanitize upper vs lower case

- 1 answer

Ad

Is there a reason that when sanitizing a string, the characters are converted to lowercase as opposed to uppercase?

I've see this convention in many languages, but in terms of my current environment, we'll say Rails and/or Javascript

Ad

Answer

Ad

No specific reason to my knowledge, but neither uppercasing nor lowercasing is the whole story in the Unicode world.

For example, the German letter ß is exactly equivalent to ss; they're both lowercase, and a word spelled with ß can also be spelled with ss.

Conversely, in Turkish, ı (dotless i) is distinct from i (dotted i), but unless your locale is Turkish, uppercasing either one produces I (dotless ASCII I). This changes meaning too. You don't want to use the wrong one; they aren't equivalent.

Because of this, some programming languages offer more specific "case normalizing" conversions per the case folding rules in section 3.13 of the Unicode standard; Python 3.3 introduced str.casefold for that reason. It's much like .lower(), but will also normalize stuff like ß to ss because they're logically equivalent (if you're uniquifying, you wouldn't want to treat two strings that differ only in ß vs. ss to be treated as different).

If you don't have case folding available in your language, then the distinction between normalizing as upper vs. lower case is mostly by convention.

Ad
source: stackoverflow.com
Ad