Ad

Regex In C To Restrict Extended ASCII Character Set

- 1 answer

I need a regex expression in C able to match everything but first 32 characters from extended ASCII with length greater than 0. I thought the easiest way to do that would be pattern like "^[^\\x00-\\x20]+$", but it's not working as I expected. For some reason it won't match any character from 48 to 92. Any ideas what's wrong with this pattern and how can I fix it?

Link to Extended ASCII character set table

Ad

Answer

The Posix regex library (i.e. the functions in regex.h, including regcomp and regexec) does not interpret standard C backslash sequences. It really doesn't need to, since C will do those expansions when you compile the character string literal. (This is something you have to think about if you accept regular expressions from user input.) The only use of \ in a regex is to escape a special character (in REG_EXTENDED mode), or to make a character special (in basic regex mode, which should be avoided.)

So if you want to exclude characters from \x01 to \x20, you would write:

 "^[^\x01-\x20]+$"

Note that you must supply the REG_EXTENDED flag to regcomp for that to work.

As you might note, that does not exclude NUL (\x00). There's no way to insert a NUL into a regex pattern because NUL is not a valid character inside a C character string; it will terminate the string. For the same reason, it's pointless to try to exclude NUL characters from a C string, because there cannot be any. However, if it made you feel better, you could use:

"^[\x21-\xFF]+$"

Semantically, those two regex patterns are identical (at least, in the default "C" locale and assuming char is 8 bits).

The character class as you wrote it, [^\\x00-\\x20], contains everything but the character x and the range from 0 (48) to \ (92). (That range overlaps with the characters 0, 2 and \, which are named explicitly, some of them twice.)

Ad
source: stackoverflow.com
Ad