bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode range and enumeration support.


From: L A Walsh
Subject: Unicode range and enumeration support.
Date: Wed, 18 Dec 2019 11:15:46 -0800
User-agent: Thunderbird

On 2019/12/16 08:39, Greg Wooledge wrote:
On Sat, Dec 14, 2019 at 02:48:16AM -0800, L A Walsh wrote:
On 2019/12/13 10:42, Greg Wooledge wrote:
There's a larger issue to be addressed first.  The man page says,
    [...]
    sary.  When characters are supplied, the  expression  expands  to  each
    character  lexicographically  between x and y, inclusive, using the de‐
    fault C locale.

----
   If it says letters that lends stronger support to including
unicode ranges of letters and numbers since the shell handles unicode and
brace expansions with unicode filenames works just fine.  That ranges don't
seems a bit of a wart.

No, it won't include Unicode, because it very clearly says "C locale"
right up there.
----
At one point in time, Bash only supported the C locale for display and input.
That isn't the case in the current Bash.  Just because it wasn't so in the
past, doesn't mean things can't or won't change in the future. If that was true
we wouldn't have computers.
The problem is, it is *not possible* to extract the set of characters
out of an arbitrary locale.  The locale interfaces simply are not built
to allow it.

You can do it in the C locale, simply because the C locale is a known,
fixed quantity that you can hard-code.  You can't do it in any other locale.
----
   You can do it in Perl, JavaScript, Python, Ruby C, C++ among others,
where range matching support has support for identifying characters of
a specific type out of arbitrary locales.  For example (from
https://www.regular-expressions.info/unicode.html):


    \p{L} or \p{Letter}: any kind of letter from any language.
    \p{Ll} or \p{Lowercase_Letter}: a lowercase letter
that has an uppercase variant.
    \p{Lu} or \p{Uppercase_Letter}: an uppercase letter
that has a lowercase variant.
     ...
   \p{Math_Symbol}: any mathematical symbol.
\p{N} or \p{Number}: any kind of numeric character in any script.

   \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any
   script except ideographic scripts.


   Those can be cross-sectioned with script-name properties from any
script in Unicode (Common, Arabic, Braille, Cherokee, Devangari...Thai,
Tibetan, Ya).  The list of support is very extensive.  Tables are
published in machine readable form that are used to build support to allow
range matching and enumeration for a huge number of characters.

   I.e. you can do it in pretty much any locale supported by Unicode, not
just the C language.  I can't begin to list all the references for this,
but just googling on:

"programming language support for ranges of numbers or alphabets in
unicode"

will show a huge number of references.

Such features could be put in [a] loadable module[s], or made "includable"
at build time to manage memory if desired/needed.

   OTOH, I already said if one didn't want to do ranges, one could follow
the easier path (I think) and allow any arbitrary unicode range to be
enumerated while ensuring quoting of ASCII-ranged meta characters.











reply via email to

[Prev in Thread] Current Thread [Next in Thread]