Ad

How Do I Compare Single Multibyte Character Constants Cross-platform In C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char's in C as well.

My current solution uses str.compare() and size() of a character string as seen in my previous post.

Now, since I only use one (multibyte) character in the std::string, would it be possible to achieve the same using a char?

For example, if( str[i] == '¶' )? How do I achieve that using char's?

(edit: made a type on SO for comparison operator as pointed out in the comments)

Ad

Answer

How do I compare single multibyte character constants cross-platform in C?

You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c', '¶') have type int, not char. The primary relevant section of C17 is paragraph 6.4.4.4/10:

An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g.,’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.

(Emphasis added.)

Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as

  • the implementation rejects integer character constants containing multibyte source characters; or
  • the implementation rejects integer character constants that do not map to a single-byte execution character; or
  • the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.

That is not an exhaustive list.

You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual chars.

Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:

    char str[] = "Some string ¶ some text ¶ to see";
    char char_to_compare[] = "¶";
    int char_size = sizeof(char_to_compare) - 1;  // don't count the string terminator

    for (char *location = strstr(str, char_to_compare);
            location;
            location = strstr(location + char_size, char_to_compare)) {
        puts("Found!");
    }

That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.

If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.

Ad
source: stackoverflow.com
Ad