What Is The Need Of Utf-8 Encoding For Things Such As Socket Communication?

I am using nodejs as my backend, and sockets for a texting part of my application. I researched about using sockets and came across an interesting fact.

The data being transferred by a socket(in my case a string) has to be utf-8 encoded. What is this utf-8 encoding used for, and why is it needed?



..The data being transferred by a socket ... has to be utf-8 encoded..

This is not fully true.

A socket can only transfer bytes and therefore it needs to get bytes. A string is not a sequence of bytes but a sequence of characters. To transfer a string over a socket it needs to be represented as a sequence of bytes first and decoded back after transfer. If you already have bytes (like a binary representation of an image) no additional encoding and decoding is needed.

There are various ways of how characters can be represented as bytes, the "character encoding". UTF-8 is one of these encodings where English characters take only a single byte, most characters from western languages take at most 2 bytes etc. There are other encodings like UTF-32 where all characters take 4 bytes or ISO-8859-15 where all characters take one byte only but which can only represent the characters found in western languages.

Because of the small overhead for western languages UTF-8 has established itself as the most common encoding for characters. But you can also use UTF-32 or others as long as you use the same encoding for both sending (encoding) and receiving (decoding).

For more information I recommend to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).