Ad

Facebook/messenger Archive Contains Emoji That I Am Unable To Parse

I can' figure out how to decode facebook's way of encoding emoji in the messenger archive.

Hi everyone, I'm trying to code a handy utility to explore messenger's archive file with PYTHON.

The message's file is a "badly encoded "JSON and as stated in this other post: Facebook JSON badly encoded

Using .encode('latin1').decode('utf8) I've been able to deal with most characters such as "é" or "à" and display them correctly. But I'm having a hard time with emojis, as they seem to be encoded in a different way.

Example of a problematic emoji : \u00f3\u00be\u008c\u00ba

The encoding/decoding does not yield any errors, but Tkinter is not willing to display what the function outputs and gives "_tkinter.TclError: character U+fe33a is above the range (U+0000-U+FFFF) allowed by Tcl". Tkinter is not yet this issue thought because trying to display the same emoji in the consol yields "ó¾º" which clearly isn't what's supposed to be displayed ( it's supposed to be a crying face)

I've tried using the emoji library but it doesn't seem to help any

>>> print(emoji.emojize("\u00f3\u00be\u008c\u00ba"))
'ó¾º'

How can I retrieve the proper emoji and display it? If it's not possible, how can I detect problematic emojis to maybe sanitize and remove them from the JSON in the first place?

Thank you in advance

Ad

Answer

.encode('latin1').decode('utf8) is correct - it results in the codepoint U+fe33a("󾌺"). This codepoint is in a Private Use Area (PUA) (specifically Supplemental Private Use Area-A), so everyone can assign his own meaning to that codepoint (Maybe facebook wanted to use a crying face, when there wasn't yet one in Unicode, so they used PUA?).

Googling for that char (https://www.google.com/search?q=󾌺) makes google autocorrect it to U+1f62d ("😭") - sadly I have no idea how google maps U+fe33a to U+1f62d.

Googling for U+fe33a site:unicode.org gives https://unicode.org/L2/L2010/10132-emojidata.pdf, which lists U+1F62D as proposed official codepoint.

As that document from unicode lists U+fe33a as a codepoint used by google, I searched for android old emoji codepoints pua. Among other stuff two actually usable results:

  1. How to get Android emoji code point - the question links to :
  2. https://github.com/googlei18n/noto-emoji/issues/115 - this thread links to:

I included my search queries in the answer, because non of the results I found are in any way authoritative - but it should be enough, to get your tool working :-)

Ad
source: stackoverflow.com
Ad