Results 1 to 1 of 1
I noticed a strange behaviour on Red Hat Ent 4.7
Can someone help me to understand?
If I have a string containing a non standard ascii character (let's say decimal ...
- 04-30-2010 #1Just Joined!
- Join Date
- Apr 2010
- Posts
- 1
encoding 850 and substring extraxtion
I noticed a strange behaviour on Red Hat Ent 4.7
Can someone help me to understand?
If I have a string containing a non standard ascii character (let's say decimal 212, hex 4d), I expect that substring extraction, through ${string
osition:length}, should work, provided that the current encoding setting includes that character.
So, utf-8 will be misleaded, because it will expect a multibyte valid sequence after x'4d', but both 850 and iso-8859-1 should work, because they include '4d', even if with different appearance (accented E, and circumflex accented O).
Yet, there are strange inconsistencies, showed in the attached script.
At the end of the script I reported as a comment the output that I get.
Max
-------------------------------------
# script to show encoding/substrict extraction problems
echo lang is initially $LANG
# create a file containing character 212
echo dummy | awk '{print "ABC" "\xd4" "DEF";}' > pippo
x="$(cat pippo)"
echo ====
echo lang is $LANG
echo x is:
echo "$x" | od -tx1z
echo substring of x is:
echo "${x:0:7}" | od -tx1z
LANG=en_US.850
echo ====
echo lang is $LANG
echo x is:
echo "$x" | od -tx1z
echo substring of x is:
echo "${x:0:7}" | od -tx1z
LANG=en_US.iso-8859-1
echo ====
echo lang is $LANG
echo x is:
echo "$x" | od -tx1z
echo substring of x is:
echo "${x:0:7}" | od -tx1z
LANG=en_US.850
echo ====
echo lang is $LANG
echo x is:
echo "$x" | od -tx1z
echo substring of x is:
echo "${x:0:7}" | od -tx1z
LANG=en_US.utf-8
echo ====
echo lang is $LANG
echo x is:
echo "$x" | od -tx1z
echo substring of x is:
echo "${x:0:7}" | od -tx1z
# this is the output that i get
# lang is initially en_US.utf-8
# ====
# lang is en_US.utf-8
# x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABC.DEF.<
# 0000010
# substring of x is:
# 0000000 41 42 43 0a >ABC.<
# 0000004
# ====
# lang is en_US.850
# x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABC.DEF.<
# 0000010
# substring of x is:
# 0000000 41 42 43 0a >ABC.<
# 0000004
# ====
# lang is en_US.iso-8859-1
# x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABCoDEF.<
# 0000010
# substring of x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABCoDEF.<
# 0000010
# ====
# lang is en_US.850
# x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABC.DEF.<
# 0000010
# substring of x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABC.DEF.<
# 0000010
# ====
# lang is en_US.utf-8
# x is:
# 0000000 41 42 43 d4 44 45 46 0a >ABC.DEF.<
# 0000010
# substring of x is:
# 0000000 41 42 43 0a >ABC.<
# 0000004


Reply With Quote