[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: char-class rules & please show examples of int. locales that use dif

From: L A Walsh
Subject: Re: char-class rules & please show examples of int. locales that use diff. char-class rules
Date: Thu, 15 Jun 2017 12:04:19 -0700
User-agent: Thunderbird

Chet Ramey wrote:
On 6/14/17 5:52 PM, L A Walsh wrote:

Chet Ramey wrote:
But people don't work in Unicode. They work in their own locale.
   Ok, I have my locale set to latin9 (iso8859-15).
Attached is a pic ...

I am less concerned with how glyphs display in a console than how they
are interpreted as alphanumerics given the C library interfaces that
determine those character classes.


   In Unicode, how those characters are classified is _documented_.
(A Categorization of Unicode Characters <http://unicode.org/notes/tn36/>,
The Unicode Character Property Model <http://unicode.org/reports/tr23/> ).
Two problems with locale-based rules are:

  1) they differ based on local convention, potentially,
even down to what "side of the street" you live on, and

  2) they don't account or allow for "data" (textual) outside
of a given locale.  For companies connected by an internet with
international customers, having a non-uniform standard is a
serious problem at best, and unworkable in practice.

   When Unicode first started becoming available in locales, many
people found themselves uncomfortably shoehorned into a locale that
wasn't their own.  I'd been using 'C' local for most things, but
with a CTYPE of UTF-8.  POSIX declared that C.UTF-8 wasn't a
valid locale (even though that's what many people wanted, and had
been using).  Next came en_US.iso8859-1 with some crap 8-bit
encoding that disallowed nearly all imported word and concept use.
While typical of POSIX, it was deemed unworkable and replaced with
en_US.UTF-8, in the next release.  en_US.UTF-8 is now the default
on all major linux distros.
Users expect alphanumerics in their own locale to be allowable as
alphanumerics, and characters that are not to not be allowable as
alphanumerics, and they expect that to be consistent across programs
running on the same system using the same locale. That's the basis to use
for determining whether or not a character is a valid part of a shell
They are.  For a in-depth of the subtleties of property names
see the perlunicode(https://perldoc.perl.org/perlunicode.html) and
perlrecharclass (https://perldoc.perl.org/perlrecharclass.html).

The latter shows differences between classes that only
match ASCII or POSIX range chars vs. properties that match all
letters or numeric digits.

A character that is classified as an alphanumeric in a particular locale,
but not in another, can lead to portability problems. That's what we're
debating here, not how something gets displayed in a text editor.
That's already a problem in that I try to use a letter from
the Greek alphabet, in a var name, and it doesn't work.  The
current code doesn't recognize letters outside some limited
POSIX-defined range. That's very constraining.

How is having UTF-8 for files and text not showing
Look at the the issue with different locales classifying
characters as alphanumerics differently, and how that would impact
variable names incorporating locale-specific characters in `portable'
   Can you give an example?  AFAIK, most locals that allow
international letters are already using the unicode definitions.  I
don't know of any locale that supports internationalized characters
that don't use the unicode rules.

   Do you have an example of different internationalized locales
that use different character-class rules, cuz I don't know of any.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]