Ad

ISO 8859 1 Octal Back To Normal Characters

- 1 answer

I'm currently converting our old project database into a new format/new database. There are some old data, which were probably escaped by a smartphone app. Now the entry looks like this:

Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat

now the real entry should look like this:

Tak hurá v posteli po práci a jde se spinkat

There are also entries like

Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca

which don't seem like ISO 8859 1, especially the \\u0161 part.

Any thoughts on any PHP function I may use to convert this back to readable version? Thanks!

Ad

Answer

Simple workaround:

The first string is only octal iso-8859-1, while the second one is double slashed iso-8859-1 with mixed utf-16 characters (why? now that is the question). The code below takes octal codes, converts to hex, packs them to binary and encodes them into utf-8. The utf-16 codes are already in hex, so they are only packed and encoded into utf-8.

For future info reference on charsets: http://www.fileformat.info/info/charset/index.htm

<?php
        $string = "Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat";
        $string2 = "Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca";

        print decode_str($string2)."<br>";
        print decode_str($string);


        function decode_str($string){
            return utf16_to_utf8(iso_to_utf8($string));
        }

        function iso_to_utf8($string){
            preg_match_all('#\\\\[0-9]{3}#',$string,$matches);

            foreach($matches[0] as $match){
                $char = preg_replace("#(\\\)#","",$match);
                $a = pack("H*" , base_convert($char,8,16));
                $string = preg_replace('#(\\\\)'.$char.'#',$a,$string);
            }
            return mb_convert_encoding($string,"UTF-8","ISO-8859-1");   
        }

        function utf16_to_utf8($string){
            preg_match_all('#\\\u[a-z0-9]{4}#',$string,$matches);

            foreach($matches[0] as $match){
                $char = preg_replace("#\\\\u#","",$match);
                $a = pack("H*" , $char);
                $a = mb_convert_encoding($a,"UTF-8","UTF-16");
                $string = preg_replace('#'.preg_quote($match).'#',$a,$string);
            }

            return $string;
        }

    ?>
Ad
source: stackoverflow.com
Ad