Find the answer to your Linux question:
Results 1 to 2 of 2
Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow My configuration Code: [pista@HP-PC MULTIBOOT]$ locale LANG=en_US.UTF-8 ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Jul 2009
    Posts
    70

    Charsets and encoding details


    Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow

    My configuration
    Code:
    [pista@HP-PC MULTIBOOT]$ locale
    LANG=en_US.UTF-8
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=
    I have file1, containing text. This text I am able to see correctly only on M$ windows, If i just open the file with less, cat or vi I get this:
    Code:
    [pista@HP-PC konvertovanie]$ cat file1 
    - Prich�dzaj�.
    - Kto prich�dza?
    N�� svet okupuj
    vyvinut� �udsk� druhy,
    
    [pista@HP-PC konvertovanie]$ less file1 
    - Prich<E1>dzaj<FA>.
    - Kto prich<E1>dza?
    N<E1><9A> svet okupuj<FA>
    vyvinut<E9> <BE>udsk<E9> druhy,
    
    [pista@HP-PC konvertovanie]$ vi file1 
    - Prichádzajú.
    - Kto prichádza?
    Ná<9a> svet okupujú
    vyvinuté ľudské druhy,
    Under linux I have to use iconv to see it correctly
    Code:
    [pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1
    - Prichádzajú.
    - Kto prichádza?
    Náš svet okupujú
    vyvinuté ľudské druhy,
    I understand that this is because of that file was coded in one format (WINDOWS-1250) and encoded in another (UTF-. But can you clarify the following?

    1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.
    Code:
    [pista@HP-PC konvertovanie]$ cat file1 | od -An -t dC -c
       45   32   80  114  105   99  104  -31  100  122   97  106   -6   46   13   10
        -         P    r    i    c    h  341    d    z    a    j  372    .   \r   \n
       45   32   75  116  111   32  112  114  105   99  104  -31  100  122   97   63
        -         K    t    o         p    r    i    c    h  341    d    z    a    ?
       13   10   78  -31 -102   32  115  118  101  116   32  111  107  117  112  117
       \r   \n    N  341  232         s    v    e    t         o    k    u    p    u
      106   -6   13   10   48   48   58   48   48   58   48   53   44   56   50   48
        j  372   \r   \n    0    0    :    0    0    :    0    5    ,    8    2    0
       32   45   45   62   32   48   48   58   48   48   58   48   55   44   54   53
             -    -    >         0    0    :    0    0    :    0    7    ,    6    5
       52   13   10  118  121  118  105  110  117  116  -23   32  -66  117  100  115
        4   \r   \n    v    y    v    i    n    u    t  351       276    u    d    s
      107  -23   32  100  114  117  104  121   44   13   10
        k  351         d    r    u    h    y    ,   \r   \n
    2.) My assumption is that if UTF-8 and WINDOWS-1250 uses for same characters different "numbers" (code representation) then if some character will be encoded using encoding1 (WINDOWS-1250) it gains approporiate "code1" from encoding1 table. So if this encoded character (or more likely it's number representation, which is "code1") will be decoded using another encoding (UTF- the only thing that happens here is that for "code1" there will be lookup in encoding2 (UTF- table and approporiate character from encoding2 table is asigned, am I right ? I think after some exaple it will be clear:

    Please look at following sites, they shows what will happend if you encode with one encoding and decode with another. Seems that until you reach 127 (decimal) boundary no mather if you decode with wrong decoding (this is why some characters in above example was displayed correctly even when wrong encoding was used).

    from UTF-8 to WINDOWS-1250
    Encoding utf-8 to windows-1250

    from WINDOWS-1250 to UTF-8
    Encoding windows-1250 to utf-8

    According this site The extreme UTF-8 table the "á" character is encoded in UTF-8 as a 225. According wikipedia Windows-1250 - Wikipedia, the free encyclopedia "á" has also value 225 in Windows-1250. So why is "á" not dispplayed correctly even if I use wrong encoding, check here and type "á" Encoding / decoding tool. Analyze character encoding problems and errors. ? Also some interesting observation, in UTF-8 table there is "š" character two times (one time with 154 and another with 453 code) why ?

    3.) If i understand it right there is no way to tell how file was encoded (unless there is some header that specify this, or you do some statistical language analysis etc.). So why/how "file" commands recognize UTF-8 encoding but not WINDOWS-1250 ?
    Code:
    [pista@HP-PC konvertovanie]$ file -bi file1 
    text/plain; charset=unknown-8bit
    [pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 > file1.utf8
    [pista@HP-PC konvertovanie]$ file -bi file1.utf8 
    text/plain; charset=utf-8
    Thank you very much

  2. #2
    Just Joined!
    Join Date
    Mar 2009
    Location
    Santa Cruz, California
    Posts
    76
    Some general information:

    ASCII is a 7-bit character set which is based on an old Teletype code, mostly good for English. It contains 128 characters numbered 0 to 127 decimal or 0 to 7f hex. There are a number of 8-bit extensions to ASCII, commonly referred to as "code pages", such as the original IBM PC Western code page, and the various 8-bit character sets which have been used by MS-Windows and many Unix systems. Windows designations look like cp 1250, cp 1251, cp1252, etc. Some of those also have ISO designations, such as iso-8859-1, iso-8859-2, etc.
    What these character sets have in common is that they are 8-bit extensions of ASCII which are identical to ASCII for the first 128 characters (so that they can all display English text), but the next 128 characters are different for the different code pages.

    For example, the Windows CP 1250 includes characters needed for various Central European languages using Latin letters with accents in the "high ascii" range 128-255, CP 1251 places Cyrillic characters in this range, and CP 1252 uses this range for accented characters used in Western European languages, as well as various punctuation characters. The ISO characters sets are similar, but I won't go into further detail, because you can look up character tables on the WWW.

    Then there is Unicode - a way of encoding characters needed for most of the world's languages. This generally refers to the 16-bit version. The first 256 characters in Unicode correspond to ISO 8859-1, for Western languages. Further characters, for other languages, and a lot of special characters and punctuation, are further up.

    UTF-8 is a way to represent 16-bit Unicode characters using 1 to 4 bytes for each character. In no way is it a 1-to-1 byte translation of one of the Windows or ISO code pages, except for ISO-8859-1. Unicode characters > 255 take 2 or more bytes in the UTF-8 encoding. UTF-8 uses the high-order bits of the first byte in a character to indicate how many bytes a character will take. Your example involves Windows CP 1250, for Polish or Czech or Slovak (I'm not sure which), so there Unicode translation is >255, thus some of the UTF-8 characters will take 2 or more bytes, so there is no byte-by-byte correspondence between the UTF-8 and the CP 1250.

    The last issue, the file command, is that the command can't distinguish between the different 8-bit code pages - it's not smart enough to parse the language content to guess the character set. Also it can't distinguish something written in a Western language in ISO-8859-1 from UTF-8 - those would be the same.

    The ISO code pages differ from the Windows code pages in that the first 32 code points (numerical values) in "high ASCII" starting at 128 shouldn't be used in the ISO codes, but are somewhat used in the Windows code pages.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •