Ad

Does Anyone Know A Good Heuristic For Detecting UTF-8 Badly Decoded As Latin-1 Text?

I'm getting weather alerts from a weather service. Although the HTTP response claims to be UTF-8, clearly it contains some text like this:

Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum.

...that should look like this:

Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum.

...but has already been improperly decoded before it first reached me, being re-encoded as UTF-8 after being decoded improperly. Most of us have probably seen this kind of "mojibake" garbage before, and visually at least, it often has a lot of common characteristics — such as lots of à characters, ¢ signs and the like.

I'm using this code to fix it up right now:

  // Check for UTF-8 wrongly decoded as Latin-1
  if (/[\x80-\xC5]/.test(result)) {
    const bytes = Buffer.from(result, 'latin1');
    const altText = bytes.toString('utf8');

    if (altText.length < result.length)
      result = altText;
  }

...and that's doing the job for now, but it's not a very sophisticated test.

Anyone know of a better method?

Ad

Answer

Anyone know of a better method?

Don't know how you would determine better. I wrote this function a while ago to do exactly this transform on a string.

Don't know if that's better than the Buffer.

function utf8_decode(str) {
  //assuming the input is a valid utf-8 string. 
  //Invalid parts are ignored / remain in the string.
  return str.replace(
    /[\u00c0-\u00df][\u0080-\u00bf]|([\u00e0-\u00ef][\u0080-\u00bf]{2})|([\u00f0-\u00f7][\u0080-\u00bf]{3})/g,
    (two, three, four) => String.fromCodePoint(
      // UTF-16 codePoints
      four ? (four.charCodeAt(0) & 7) << 18 | (four.charCodeAt(1) & 63) << 12 | (four.charCodeAt(2) & 63) << 6 | (four.charCodeAt(3) & 63) :
      // UTF-8 multibytes
      three ? (three.charCodeAt(0) & 15) << 12 | (three.charCodeAt(1) & 63) << 6 | (three.charCodeAt(2) & 63) :
      (two.charCodeAt(0) & 31) << 6 | (two.charCodeAt(1) & 63)
    )
  )
}

console.log(utf8_decode("Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum."));

console.log(utf8_decode("ð\x9F\x98\x8B"));

The regex is better than yours.

No need to check afterwards if the transform resulted in some change.

Ad
source: stackoverflow.com
Ad