Ad

Reading UTF-8 Text File As UTF-16 In Java

I am reading a UTF-8 encoded text file in my Java program as UTF-16 just to see what happens. I am getting output string containing only '?'. Could anyone please explain how the UTF-8 codepoints are getting converted to UTF-16 and why am I getting only '?' in my output.

This is the code-

public class MyUTF {

    public static void main(String[] args)
        throws IOException, FileNotFoundException
    {
        InputStream is=new FileInputStream("file1.txt");
        System.out.println(is.available());

        InputStreamReader isr=new InputStreamReader(is,"UTF-16");
        BufferedReader br=new BufferedReader(isr);
        System.out.println(br.readLine());
    }
}

If the file contains 'a' then I get '?' as output. If it contains 'abc' then I get '??'.

Please explain this conversion from UTF-8 to UTF-16.

Thanks in advance.

Ad

Answer

What you see in your terminal depends on many factors:

  • Is your platform little or big endian?
  • Can your terminal display lots of characters or only a few?

If you are just seeing question marks, you probably have an old computer or a very hobbled terminal emulator.

I can tell you what I see on my Mac. My laptop is little endian. I made the file file1.txt contain abc then a new line. In other words, the four characters U+0061 U+0062 U+0063 U+000A. Now since UTF-8 is the default encoding my file contains 4 bytes:

61 62 63 0A

Please understand A file only contains bytes. It does not contain characters. (Sure there are tricks like sticking BOMs in the file to make the file’s intended encoding apparent, but really it is just a suggestion.)

Now when you read in that file as UTF-16, you decoded those four bytes into two characters:

U+6162
U+630A

When I run your program, it prints like this for me

慢挊

Now suppose I did not have the newline so the file had only three bytes. In this case

61 62 63

Now when I run your program I see

慢�

That is the character U+6162 as before, and then the replacement character because you cannot decode the single byte 63 in UTF-16. In UTF-16 characters are represented in either 2 or 4 bytes, never just 1. My terminal program shows replacement characters. I think yours just shows question marks.

Ad
source: stackoverflow.com
Ad