fileutils/textutils LC_COLLATE support

bug-fileutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

fileutils/textutils LC_COLLATE support

From:	Corin Hartland-Swann
Subject:	fileutils/textutils LC_COLLATE support
Date:	Thu, 11 Oct 2001 18:57:28 +0100 (BST)

Hi there,

I am sending this to you directly in addition to the bug reporting
addresses since you seem to be the maintainers of the fileutils and
textutils packages. I sent an e-mail about a smaller facet of the same
problem to address@hidden on 7-Dec-2000, but never received
a response.

I am using Linux Mandrake 8.1, with glibc 2.2.4, fileutils 4.1 and
textutils 2.0.14 with the ISO 8859-1 character set (en_GB).

I noticed that ls(1) and sort(1) were not ordering things the way I
expected:

  # touch X ab a-c
  # ls
  ab
  a-c
  X

  # cat > z
  X
  ab
  a-c
  ^D
  # sort < z
  ab
  a-c
  X

To summarise, both ls(1) and sort(1) are ignoring dashes and treating
upper and lower-case characters as equivalent.

I have a number of programs that are designed based on the assumption that
sort(1) will sort things by byte ordering, and I'm sure a lot of other
people have similarly-dependent programs. I also expect ls(1) to list
files beginning with upper-case first letters before those beginning with
lower-case letters, and I'm sure a lot of other people are used to that to
(AFAIK it's always been done that way on UNIX, for as long as there _were_
lower case letters :)

I have discussed this at length with the Mandrake developers, and they
have told me that the change in behaviour is due to advances made in glibc
with respect to LC_COLLATE handling under ISO 8859-1, and that this form
of collation is intended to be more logical for people to read.

I agree that it is more logical, but I do not think it should be the
default behaviour for sort(1) and ls(1).

Although this currently only affects Mandrake v8.1 (AFAIK), as more
cautious distributions adopt more recent versions of glibc more and more
people are going to experience this. It can be kludged by exporting the
environment variable LC_COLLATE=POSIX, but that prevents collation from
working at all.

What I think would be the best solution is if ls(1) and sort(1) (and
possibly other programs in textutils) were designed to sort by
byte-ordering by default, and were given an option to use the locale-based
collation. The existing options for sort(1) include:

       Ordering options:

       -b, --ignore-leading-blanks ignore leading blanks

       -d, --dictionary-order
              consider only blanks and alphanumeric characters

       -f, --ignore-case
              fold lower case to upper case characters

       -g, --general-numeric-sort
              compare according to general numerical value

       -i, --ignore-nonprinting
              consider only printable characters

       -M, --month-sort
              compare (unknown) < `JAN' < ... < `DEC'

       -n, --numeric-sort
              compare according to string numerical value

       -r, --reverse
              reverse the result of comparisons

Which all suggest that the intent behind the sort program is to do byte-
ordering unless otherwise directed. The --ignore-case option, for
instance, is now meaningless under ISO 8859-1 because LC_COLLATE makes
upper and lower-cased letters equivalent.

The man page for sort(1) states:

       ***  WARNING  ***  The locale specified by the environment
       affects sort order.  Set LC_ALL=C to get  the  traditional
       sort order that uses native byte values.

But this had not previously been apparent because LC_COLLATE did not work
properly. I realise that I can fix it by exporting LC_COLLATE=POSIX, but
I'm sure I'm not the only one who has assumed that byte ordering would
remain the default action.

Would you consider adding additional options, for instance:

       -l, --use-locale
              use the ordering specified by the current locale in
              LC_COLLATE instead of byte ordering

And returning the default behaviour to byte ordering?

Similarly the ls(1) man page states that the default action is to sort
file names alphabetically, and makes no mention of locales.

I believe that this is the right thing to do because it preserves the
existing and expected behaviour, but allows the user to specify locale-
based collation if they want to. I think that this is something that
should be specified explicitly.

Many Thanks,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        | 
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: address@hidden        |
\------------------------+-------------------------------------/

[Prev in Thread]

Current Thread

[Next in Thread]

fileutils/textutils LC_COLLATE support, Corin Hartland-Swann <=
- Re: fileutils/textutils LC_COLLATE support, Paul Eggert, 2001/10/11
  - Re: fileutils/textutils LC_COLLATE support, Corin Hartland-Swann, 2001/10/19
    - Re: fileutils/textutils LC_COLLATE support, Paul Eggert, 2001/10/19
    - Re: fileutils/textutils LC_COLLATE support, Corin Hartland-Swann, 2001/10/19
    - Re: fileutils/textutils LC_COLLATE support, Bob Proulx, 2001/10/19

Prev by Date: "Text file busy" problem with cp/mv/install
Next by Date: verbosity levels in recursive rm/mv/cp
Previous by thread: "Text file busy" problem with cp/mv/install
Next by thread: Re: fileutils/textutils LC_COLLATE support
Index(es):
- Date
- Thread