Results 1 to 2 of 2
Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow My configuration Code: [pista@HP-PC MULTIBOOT]$ locale LANG=en_US.UTF-8 ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
- 11-14-2012 #1
- Join Date
- Jul 2009
Charsets and encoding details
Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow
[pista@HP-PC MULTIBOOT]$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
[pista@HP-PC konvertovanie]$ cat file1 - Prich�dzaj�. - Kto prich�dza? N�� svet okupuj vyvinut� �udsk� druhy, [pista@HP-PC konvertovanie]$ less file1 - Prich<E1>dzaj<FA>. - Kto prich<E1>dza? N<E1><9A> svet okupuj<FA> vyvinut<E9> <BE>udsk<E9> druhy, [pista@HP-PC konvertovanie]$ vi file1 - Prichádzajú. - Kto prichádza? Ná<9a> svet okupujú vyvinuté ľudské druhy,
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 - Prichádzajú. - Kto prichádza? Náš svet okupujú vyvinuté ľudské druhy,
1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.
[pista@HP-PC konvertovanie]$ cat file1 | od -An -t dC -c 45 32 80 114 105 99 104 -31 100 122 97 106 -6 46 13 10 - P r i c h 341 d z a j 372 . \r \n 45 32 75 116 111 32 112 114 105 99 104 -31 100 122 97 63 - K t o p r i c h 341 d z a ? 13 10 78 -31 -102 32 115 118 101 116 32 111 107 117 112 117 \r \n N 341 232 s v e t o k u p u 106 -6 13 10 48 48 58 48 48 58 48 53 44 56 50 48 j 372 \r \n 0 0 : 0 0 : 0 5 , 8 2 0 32 45 45 62 32 48 48 58 48 48 58 48 55 44 54 53 - - > 0 0 : 0 0 : 0 7 , 6 5 52 13 10 118 121 118 105 110 117 116 -23 32 -66 117 100 115 4 \r \n v y v i n u t 351 276 u d s 107 -23 32 100 114 117 104 121 44 13 10 k 351 d r u h y , \r \n
Please look at following sites, they shows what will happend if you encode with one encoding and decode with another. Seems that until you reach 127 (decimal) boundary no mather if you decode with wrong decoding (this is why some characters in above example was displayed correctly even when wrong encoding was used).
from UTF-8 to WINDOWS-1250
Encoding utf-8 to windows-1250
from WINDOWS-1250 to UTF-8
Encoding windows-1250 to utf-8
According this site The extreme UTF-8 table the "á" character is encoded in UTF-8 as a 225. According wikipedia Windows-1250 - Wikipedia, the free encyclopedia "á" has also value 225 in Windows-1250. So why is "á" not dispplayed correctly even if I use wrong encoding, check here and type "á" Encoding / decoding tool. Analyze character encoding problems and errors. ? Also some interesting observation, in UTF-8 table there is "š" character two times (one time with 154 and another with 453 code) why ?
3.) If i understand it right there is no way to tell how file was encoded (unless there is some header that specify this, or you do some statistical language analysis etc.). So why/how "file" commands recognize UTF-8 encoding but not WINDOWS-1250 ?
[pista@HP-PC konvertovanie]$ file -bi file1 text/plain; charset=unknown-8bit [pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 > file1.utf8 [pista@HP-PC konvertovanie]$ file -bi file1.utf8 text/plain; charset=utf-8
- 11-15-2012 #2
- Join Date
- Mar 2009
- Santa Cruz, California
Some general information:
ASCII is a 7-bit character set which is based on an old Teletype code, mostly good for English. It contains 128 characters numbered 0 to 127 decimal or 0 to 7f hex. There are a number of 8-bit extensions to ASCII, commonly referred to as "code pages", such as the original IBM PC Western code page, and the various 8-bit character sets which have been used by MS-Windows and many Unix systems. Windows designations look like cp 1250, cp 1251, cp1252, etc. Some of those also have ISO designations, such as iso-8859-1, iso-8859-2, etc.
What these character sets have in common is that they are 8-bit extensions of ASCII which are identical to ASCII for the first 128 characters (so that they can all display English text), but the next 128 characters are different for the different code pages.
For example, the Windows CP 1250 includes characters needed for various Central European languages using Latin letters with accents in the "high ascii" range 128-255, CP 1251 places Cyrillic characters in this range, and CP 1252 uses this range for accented characters used in Western European languages, as well as various punctuation characters. The ISO characters sets are similar, but I won't go into further detail, because you can look up character tables on the WWW.
Then there is Unicode - a way of encoding characters needed for most of the world's languages. This generally refers to the 16-bit version. The first 256 characters in Unicode correspond to ISO 8859-1, for Western languages. Further characters, for other languages, and a lot of special characters and punctuation, are further up.
UTF-8 is a way to represent 16-bit Unicode characters using 1 to 4 bytes for each character. In no way is it a 1-to-1 byte translation of one of the Windows or ISO code pages, except for ISO-8859-1. Unicode characters > 255 take 2 or more bytes in the UTF-8 encoding. UTF-8 uses the high-order bits of the first byte in a character to indicate how many bytes a character will take. Your example involves Windows CP 1250, for Polish or Czech or Slovak (I'm not sure which), so there Unicode translation is >255, thus some of the UTF-8 characters will take 2 or more bytes, so there is no byte-by-byte correspondence between the UTF-8 and the CP 1250.
The last issue, the file command, is that the command can't distinguish between the different 8-bit code pages - it's not smart enough to parse the language content to guess the character set. Also it can't distinguish something written in a Western language in ISO-8859-1 from UTF-8 - those would be the same.
The ISO code pages differ from the Windows code pages in that the first 32 code points (numerical values) in "high ASCII" starting at 128 shouldn't be used in the ISO codes, but are somewhat used in the Windows code pages.