Results 1 to 10 of 12
So this is really bugging me. Why is [a-z] not case sensitive, but [A-Z] is? For example:
# ls -l
total 0
-rw-r--r-- 1 root root 0 Nov 20 12:22 ...
- 11-20-2008 #1Just Joined!
- Join Date
- Nov 2008
- Posts
- 5
Weirdness with BASH wildcards
So this is really bugging me. Why is [a-z] not case sensitive, but [A-Z] is? For example:
# ls -l
total 0
-rw-r--r-- 1 root root 0 Nov 20 12:22 xa
-rw-r--r-- 1 root root 0 Nov 20 12:22 xA
# ls -l x[a-z]
-rw-r--r-- 1 root root 0 Nov 20 12:22 xa
-rw-r--r-- 1 root root 0 Nov 20 12:22 xA
# ls -l x[A-Z]
-rw-r--r-- 1 root root 0 Nov 20 12:22 xA
Any ideas?
grendelos
- 11-21-2008 #2Just Joined!
- Join Date
- Nov 2008
- Posts
- 26
Interesting. And of course, `man bash` makes no mention of this.
I have always disliked bash passionately since it seems to be a cheap knockoff of a solution that's already free (ksh). Why do we need a cheap knockoff for a free entity? Who cares that we don't own the source, if we're not planning on changing it?!?!!
If you install pdksh, and switch to that as your shell, you will find that it operates more like you'd expect. (I tried it. Your first example yields two lines. Your 2nd/3rd examples yield one line each.)
- 11-21-2008 #3
Your system works differently from mine, so I'm playing blind here. So in that same directory in which you've been working, do this:
What output do you get?Code:touch xa; touch xb; touch xc touch xA; touch xB; touch xC echo $LANG ls x*
--
Bill
Old age and treachery will overcome youth and skill.
- 11-21-2008 #4Just Joined!
- Join Date
- Nov 2008
- Posts
- 5
Here is what I get.
# touch xa; touch xb; touch xc
# touch xA; touch xB; touch xC
# echo $LANG
en_US.UTF-8
# ls x*
xa xA xb xB xc xC
grendelos
- 11-21-2008 #5
The issue of interest here is not case sensitivity, but the order in which letters are ordered. In your case, the order is:
So the range [a-z] will include these in red:Code:aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
and the range [A-Z] will include these in red:Code:aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
This excludes the lower case a, which explains why it was excluded in your original [A-Z] experiment.Code:aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
Now try this:
You should getCode:export LANG=en_US ls x*
because the collating sequence isCode:xA xB xC xa xb xc
and the range [A-Z] will get youCode:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
and the range [a-z] will get youCode:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Then if you try this:Code:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
you should get what looks like case sensitivity in both cases.Code:ls x[a-z] ls x[A-Z]
Hope this helps.--
Bill
Old age and treachery will overcome youth and skill.
- 11-21-2008 #6Just Joined!
- Join Date
- Nov 2008
- Posts
- 5
Thanks. That is a very good explanation. I have also stumbled across this suggestion which works as well:
export LC_ALL=C
But I am not certain what this variable represents exactly. Are LC_ALL=C and LANG=en_US related somehow or another?
grendelos
- 11-21-2008 #7Linux Newbie
- Join Date
- Jul 2008
- Posts
- 181
The problem here is the unexpected "collation order". The bash manpage does actually mention the "LC_COLLATE" variable, but if you do not know what it does beforehand, you will probably not understand the explanation. See "man 7 glob" for some more information on collation order.
You can set the collation order to "C" in order to get the old-style ascii order. If you want to refer to upper case letter, lower case letters or all letters, you should probably use named character classes, such as "[:lower:]" anyway.
- 11-21-2008 #8Linux Newbie
- Join Date
- Jul 2008
- Posts
- 181
Clearly, you do not know what you are talking about.
First of all, bash is not a "knockoff" of ksh. Both the name of the "Bourne-Again SHell" and the description section of the man page should have told you that. Secondly, although the man page does not explain the effect in sufficient detail, it is clearly not a bug, but an internationalization feature.
- 11-21-2008 #9
Quoth the highly esteemed delovelady:
... and the equally highly esteemed burschik:I have always disliked bash passionately since it seems to be a cheap knockoff
Um, folks, let's push beyond the pleasantries and get to the heart of the matter, mmmkay?Clearly, you do not know what you are talking about.
First, I must apologize for not running the following tests with pdksh. It's not installed on my system, and I was too lazy to install it just for the purpose of these tests.
Instead, I use the standard AT&T version of ksh. Like pdksh, it's free (as in "beer" and "speech"). Like pdksh, the source is freely available and compilable.
Moving right along:
There are several potential environment variables involved here. Each of them can have the same possible values, with the same meanings of those values. Here they are:
These environment variables address internationalization issues and character set issues.Code:LANG LC_ALL LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME
The simplest value of each of these variables is C, or POSIX; both values mean the same thing. It basically says that no internationalization customization is to be done, and the collating sequence is that of the standard ASCII character set. To see the order of characters in the standard ASCII character set, do this at the command line:
If your system does not have this man page, you can scroogle it as follows:Code:man ascii
Note that all the upper case letters appear before any of the lower case letters.Code:Linux man ascii
Other interesting values are en_US (which means English, as spoken in the US), and en_US.UTF-8. The UTF-8 specifies the collating sequence of the characters, and indicates that the letters are to be ordered thus:
(Look familiar?)Code:aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
What can we use instead of en_US? That depends on your system. I'm using Slackware 12.1. The following files are in directory /usr/share/i18n/locales:
What can we use instead of UTF-8? Well, we can leave it (and the preceding period) out, or again use whatever's installed on the system. For Slackware 12.1, I see the following files in /usr/share/i18n/charmaps:Code:aa_DJ es_MX nl_BE aa_ER es_NI nl_BE@euro aa_ER@saaho es_PA nl_NL aa_ET es_PE nl_NL@euro af_ZA es_PR nn_NO am_ET es_PY nr_ZA an_ES es_SV nso_ZA ar_AE es_US oc_FR ar_BH es_UY om_ET ar_DZ es_VE om_KE ar_EG et_EE or_IN ar_IN eu_ES pa_IN ar_IQ eu_ES@euro pap_AN ar_JO fa_IR pa_PK ar_KW fi_FI pl_PL ar_LB fi_FI@euro POSIX ar_LY fil_PH pt_BR ar_MA fo_FO pt_PT ar_OM fr_BE pt_PT@euro ar_QA fr_BE@euro ro_RO ar_SA fr_CA ru_RU ar_SD fr_CH ru_UA ar_SY fr_FR rw_RW ar_TN fr_FR@euro sa_IN ar_YE fr_LU sc_IT as_IN fr_LU@euro se_NO ast_ES fur_IT sid_ET az_AZ fy_DE si_LK be_BY fy_NL sk_SK be_BY@latin ga_IE sl_SI ber_DZ ga_IE@euro so_DJ ber_MA gd_GB so_ET bg_BG gez_ER so_KE bn_BD gez_ER@abegede so_SO bn_IN gez_ET sq_AL br_FR gez_ET@abegede sr_ME br_FR@euro gl_ES sr_RS bs_BA gl_ES@euro sr_RS@latin byn_ER gu_IN ss_ZA ca_AD gv_GB st_ZA ca_ES ha_NG sv_FI ca_ES@euro he_IL sv_FI@euro ca_FR hi_IN sv_SE ca_IT hr_HR ta_IN crh_UA hsb_DE te_IN csb_PL hu_HU tg_TJ cs_CZ hy_AM th_TH cy_GB i18n ti_ER da_DK id_ID ti_ET de_AT ig_NG tig_ER de_AT@euro ik_CA tk_TM de_BE is_IS tl_PH de_BE@euro iso14651_t1 tn_ZA de_CH iso14651_t1_common translit_circle de_DE iso14651_t1_pinyin translit_cjk_compat de_DE@euro it_CH translit_cjk_variants de_LU it_IT translit_combining de_LU@euro it_IT@euro translit_compat dz_BT iu_CA translit_font el_CY iw_IL translit_fraction el_GR ja_JP translit_hangul el_GR@euro ka_GE translit_narrow en_AU kk_KZ translit_neutral en_BW kl_GL translit_small en_CA km_KH translit_wide en_DK kn_IN tr_CY en_GB ko_KR tr_TR en_HK ku_TR ts_ZA en_IE kw_GB tt_RU en_IE@euro ky_KG tt_RU@iqtelif en_IN lg_UG ug_CN en_NG li_BE uk_UA en_NZ li_NL ur_PK en_PH lo_LA uz_UZ en_SG lt_LT uz_UZ@cyrillic en_US lv_LV ve_ZA en_ZA mai_IN vi_VN en_ZW mg_MG wa_BE es_AR mi_NZ wa_BE@euro es_BO mk_MK wal_ET es_CL ml_IN wo_SN es_CO mn_MN xh_ZA es_CR mr_IN yi_US es_DO ms_MY yo_NG es_EC mt_MT zh_CN es_ES nb_NO zh_HK es_ES@euro nds_DE zh_SG es_GT nds_NL zh_TW es_HN ne_NP zu_ZA
What do the various environment variables do with these values? This post is already getting too long, so go here for the answer.Code:ANSI_X3.110-1983.gz IBM869.gz ANSI_X3.4-1968.gz IBM870.gz ARMSCII-8.gz IBM871.gz ASMO_449.gz IBM874.gz BIG5.gz IBM875.gz BIG5-HKSCS.gz IBM880.gz BRF.gz IBM891.gz BS_4730.gz IBM903.gz BS_VIEWDATA.gz IBM904.gz CP10007.gz IBM905.gz CP1125.gz IBM918.gz CP1250.gz IBM922.gz CP1251.gz IEC_P27-1.gz CP1252.gz INIS-8.gz CP1253.gz INIS-CYRILLIC.gz CP1254.gz INIS.gz CP1255.gz INVARIANT.gz CP1256.gz ISIRI-3342.gz CP1257.gz ISO_10367-BOX.gz CP1258.gz ISO_10646.gz CP737.gz ISO_11548-1.gz CP775.gz ISO_2033-1983.gz CP949.gz ISO_5427-EXT.gz CSA_Z243.4-1985-1.gz ISO_5427.gz CSA_Z243.4-1985-2.gz ISO_5428.gz CSA_Z243.4-1985-GR.gz ISO_646.BASIC.gz CSN_369103.gz ISO_646.IRV.gz CWI.gz ISO_6937-2-25.gz DEC-MCS.gz ISO_6937-2-ADD.gz DIN_66003.gz ISO_6937.gz DS_2089.gz ISO-8859-10.gz EBCDIC-AT-DE-A.gz ISO-8859-11.gz EBCDIC-AT-DE.gz ISO-8859-13.gz EBCDIC-CA-FR.gz ISO-8859-14.gz EBCDIC-DK-NO-A.gz ISO-8859-15.gz EBCDIC-DK-NO.gz ISO-8859-16.gz EBCDIC-ES-A.gz ISO_8859-1,GL.gz EBCDIC-ES.gz ISO-8859-1.gz EBCDIC-ES-S.gz ISO-8859-2.gz EBCDIC-FI-SE-A.gz ISO-8859-3.gz EBCDIC-FI-SE.gz ISO-8859-4.gz EBCDIC-FR.gz ISO-8859-5.gz EBCDIC-IS-FRISS.gz ISO-8859-6.gz EBCDIC-IT.gz ISO-8859-7.gz EBCDIC-PT.gz ISO-8859-8.gz EBCDIC-UK.gz ISO-8859-9E.gz EBCDIC-US.gz ISO-8859-9.gz ECMA-CYRILLIC.gz ISO_8859-SUPP.gz ES2.gz ISO-IR-197.gz ES.gz ISO-IR-209.gz EUC-JISX0213.gz ISO-IR-90.gz EUC-JP.gz IT.gz EUC-JP-MS.gz JIS_C6220-1969-JP.gz EUC-KR.gz JIS_C6220-1969-RO.gz EUC-TW.gz JIS_C6229-1984-A.gz GB18030.gz JIS_C6229-1984-B-ADD.gz GB_1988-80.gz JIS_C6229-1984-B.gz GB2312.gz JIS_C6229-1984-HAND-ADD.gz GBK.gz JIS_C6229-1984-HAND.gz GEORGIAN-ACADEMY.gz JIS_C6229-1984-KANA.gz GEORGIAN-PS.gz JIS_X0201.gz GOST_19768-74.gz JOHAB.gz GREEK7.gz JUS_I.B1.002.gz GREEK7-OLD.gz JUS_I.B1.003-MAC.gz GREEK-CCITT.gz JUS_I.B1.003-SERB.gz HP-ROMAN8.gz KOI-8.gz IBM037.gz KOI8-R.gz IBM038.gz KOI8-RU.gz IBM1004.gz KOI8-T.gz IBM1026.gz KOI8-U.gz IBM1047.gz KSC5636.gz IBM1124.gz LATIN-GREEK-1.gz IBM1129.gz LATIN-GREEK.gz IBM1132.gz MAC-CENTRALEUROPE.gz IBM1133.gz MAC-CYRILLIC.gz IBM1160.gz MACINTOSH.gz IBM1161.gz MAC-IS.gz IBM1162.gz MAC-SAMI.gz IBM1163.gz MAC-UK.gz IBM1164.gz MIK.gz IBM256.gz MSZ_7795.3.gz IBM273.gz NATS-DANO-ADD.gz IBM274.gz NATS-DANO.gz IBM275.gz NATS-SEFI-ADD.gz IBM277.gz NATS-SEFI.gz IBM278.gz NC_NC00-10.gz IBM280.gz NEXTSTEP.gz IBM281.gz NF_Z_62-010_1973.gz IBM284.gz NF_Z_62-010.gz IBM285.gz NS_4551-1.gz IBM290.gz NS_4551-2.gz IBM297.gz PT154.gz IBM420.gz PT2.gz IBM423.gz PT.gz IBM424.gz RK1048.gz IBM437.gz SAMI.gz IBM500.gz SAMI-WS2.gz IBM850.gz SEN_850200_B.gz IBM851.gz SEN_850200_C.gz IBM852.gz SHIFT_JIS.gz IBM855.gz SHIFT_JISX0213.gz IBM856.gz T.101-G2.gz IBM857.gz T.61-7BIT.gz IBM860.gz T.61-8BIT.gz IBM861.gz TCVN5712-1.gz IBM862.gz TIS-620.gz IBM863.gz TSCII.gz IBM864.gz UTF-8.gz IBM865.gz VIDEOTEX-SUPPL.gz IBM866.gz VISCII.gz IBM866NAV.gz WINDOWS-31J.gz IBM868.gz
Now let's explore the question of whether these environment variables are used by ls, or by the shell. The short answer is: both!
If you want to see ls sort the filenames, do this at the command line:
Note that with this command, no shell globbing is done, because you haven't used anything like * or ? or x[A-Z].Code:ls
If you want the shell to sort the filenames, do this at the command line:
Moving right along, I wrote a shell script which demonstrates that the behavior is the same, whether the filenames are sorted by ls, bash, or ksh. Here's the script:Code:echo *
I got the following output from this script:Code:#!/bin/bash #----------------------------------------------------------------------------- try_both() { cat > collation_script echo === we are about to run this script: cat collation_script echo === here is the output for bash: bash < collation_script 2>&1 echo === here is the output for ksh: ksh < collation_script 2>&1 } #----------------------------------------------------------------------------- rm -rf collation_experiment umask 077 mkdir collation_experiment touch collation_experiment/a touch collation_experiment/b touch collation_experiment/c touch collation_experiment/A touch collation_experiment/B touch collation_experiment/C unset LANG unset LC_ALL unset LC_COLLATE unset LC_CTYPE unset LC_MESSAGES unset LC_MONETARY unset LC_NUMERIC unset LC_TIME unset LC_NLSPATH echo +++ Demonstrate that we really are executing each shell. try_both <<EOD help EOD echo +++ Use no environment variables. try_both <<EOD cd collation_experiment echo /// ls handles the sorting: ls echo /// The shell handles the sorting: echo * EOD echo +++ Use LANG=C\; LANG=POSIX would do the same thing. echo +++ We\'ll see the same output as above. export LANG=C try_both <<EOD cd collation_experiment echo /// ls handles the sorting: ls echo /// The shell handles the sorting: echo * EOD echo +++ Use LANG=en_US.UTF-8. export LANG=en_US.UTF-8 try_both <<EOD cd collation_experiment echo /// ls handles the sorting: ls echo /// The shell handles the sorting: echo * EOD
Three loose ends.Code:+++ Demonstrate that we really are executing each shell. === we are about to run this script: help === here is the output for bash: GNU bash, version 3.1.17(2)-release (i486-slackware-linux-gnu) These shell commands are defined internally. Type `help' to see this list. Type `help name' to find out more about the function `name'. Use `info bash' to find out more about the shell in general. Use `man -k' or `info' to find out more about commands not in this list. A star (*) next to a name means that the command is disabled. JOB_SPEC [&] (( expression )) . filename [arguments] : [ arg... ] [[ expression ]] alias [-p] [name[=value] ... ] bg [job_spec ...] bind [-lpvsPVS] [-m keymap] [-f fi break [n] builtin [shell-builtin [arg ...]] caller [EXPR] case WORD in [PATTERN [| PATTERN]. cd [-L|-P] [dir] command [-pVv] command [arg ...] compgen [-abcdefgjksuv] [-o option complete [-abcdefgjksuv] [-pr] [-o continue [n] declare [-afFirtx] [-p] [name[=val dirs [-clpv] [+N] [-N] disown [-h] [-ar] [jobspec ...] echo [-neE] [arg ...] enable [-pnds] [-a] [-f filename] eval [arg ...] exec [-cl] [-a name] file [redirec exit [n] export [-nf] [name[=value] ...] or false fc [-e ename] [-nlr] [first] [last fg [job_spec] for NAME [in WORDS ... ;] do COMMA for (( exp1; exp2; exp3 )); do COM function NAME { COMMANDS ; } or NA getopts optstring name [arg] hash [-lr] [-p pathname] [-dt] [na help [-s] [pattern ...] history [-c] [-d offset] [n] or hi if COMMANDS; then COMMANDS; [ elif jobs [-lnprs] [jobspec ...] or job kill [-s sigspec | -n signum | -si let arg [arg ...] local name[=value] ... logout popd [+N | -N] [-n] printf [-v var] format [arguments] pushd [dir | +N | -N] [-n] pwd [-LP] read [-ers] [-u fd] [-t timeout] [ readonly [-af] [name[=value] ...] return [n] select NAME [in WORDS ... ;] do CO set [--abefhkmnptuvxBCHP] [-o opti shift [n] shopt [-pqsu] [-o long-option] opt source filename [arguments] suspend [-f] test [expr] time [-p] PIPELINE times trap [-lp] [arg signal_spec ...] true type [-afptP] name [name ...] typeset [-afFirtx] [-p] name[=valu ulimit [-SHacdfilmnpqstuvx] [limit umask [-p] [-S] [mode] unalias [-a] name [name ...] unset [-f] [-v] [name ...] until COMMANDS; do COMMANDS; done variables - Some variable names an wait [n] while COMMANDS; do COMMANDS; done { COMMANDS ; } === here is the output for ksh: ksh: line 1: help: not found +++ Use no environment variables. === we are about to run this script: cd collation_experiment echo /// ls handles the sorting: ls echo /// The shell handles the sorting: echo * === here is the output for bash: /// ls handles the sorting: A B C a b c /// The shell handles the sorting: A B C a b c === here is the output for ksh: /// ls handles the sorting: A B C a b c /// The shell handles the sorting: A B C a b c +++ Use LANG=C; LANG=POSIX would do the same thing. +++ We'll see the same output as above. === we are about to run this script: cd collation_experiment echo /// ls handles the sorting: ls echo /// The shell handles the sorting: echo * === here is the output for bash: /// ls handles the sorting: A B C a b c /// The shell handles the sorting: A B C a b c === here is the output for ksh: /// ls handles the sorting: A B C a b c /// The shell handles the sorting: A B C a b c +++ Use LANG=en_US.UTF-8. === we are about to run this script: cd collation_experiment echo /// ls handles the sorting: ls echo /// The shell handles the sorting: echo * === here is the output for bash: /// ls handles the sorting: a A b B c C /// The shell handles the sorting: a A b B c C === here is the output for ksh: /// ls handles the sorting: a A b B c C /// The shell handles the sorting: a A b B c C
Loose end one.
You'll recall that for Slackware 12.1 I mentioned these directories:
You may well ask: what's the 18 doing in there? It's an abbreviation for 18 characters which have been left out. Here's the count:Code:/usr/share/i18n/locales /usr/share/i18n/charmaps
Loose end two.Code:internationalization 000000000111111111 123456789012345678 internationalisation
The danger with all of this is that if filenames are globbed differently from system to system, you can accidentally delete files you didn't intend to. My wife's system uses en_US.UTF-8; mine uses en_US. I wanted to delete all files in a particular directory which began with upper case letters, so I did this:
That would have been fine on my system, but on hers, it also deleted all files which began with lower case letters, except the letter a.Code:rm [A-Z]*
Sigh.
Third loose end.
delovelady, I would request that you run this same script using pdksh instead of the standard ksh, and report back whether you observe the same behavior, or different behavior.
If the behavior is different, I would call that a bug in pdksh. But I'm guessing that the behavior is the same.--
Bill
Old age and treachery will overcome youth and skill.
- 11-21-2008 #10Linux Newbie
- Join Date
- Jul 2008
- Posts
- 181
Thanks for the lucid exposition.
This is clearly bad, but what are the alternatives? The old ascii behaviour is only suitable for English, and possibly other languages that use exactly the same alphabet (offhand, I can't think of any). If someone speaking German or French, for example, used a glob like "[a-z]" he or she would naturally expect accented or umlaut vowels to be included in that range. On the other hand, someone speaking English would probably be surprised to see these characters included in the range.
Similarly, people not interested in the (ancient) history of computing might be surprised by a collation order that sorts "a" before "Z", since that is not what dictionaries and similar works have been doing for centuries.
However, the situation would be (slightly) improved if legacy character sets went the way of the dodo. And in the context of Linux, everything except UTF-8 should be considered a legacy character set (except maybe for East Asian languages, I am a bit hazy on that point).


Reply With Quote
