bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode range and enumeration support.


From: Greg Wooledge
Subject: Re: Unicode range and enumeration support.
Date: Wed, 18 Dec 2019 14:46:51 -0500
User-agent: Mutt/1.10.1 (2018-07-13)

On Wed, Dec 18, 2019 at 11:15:46AM -0800, L A Walsh wrote:
> On 2019/12/16 08:39, Greg Wooledge wrote:
> > The problem is, it is *not possible* to extract the set of characters
> > out of an arbitrary locale.  The locale interfaces simply are not built
> > to allow it.
> > 
> > You can do it in the C locale, simply because the C locale is a known,
> > fixed quantity that you can hard-code.  You can't do it in any other locale.

>    You can do it in Perl, JavaScript, Python, Ruby C, C++ among others,
> [...]
>     \p{L} or \p{Letter}: any kind of letter from any language.
>     \p{Ll} or \p{Lowercase_Letter}: a lowercase letter
> that has an uppercase variant.

You misunderstood me, or perhaps I wasn't clear enough.

I agree that if you are GIVEN a character as input, you can determine
whether that character is a letter, or a lowercase letter (etc.) in
the current locale.

What you CANNOT do[1] is GENERATE all of the lowercase letters (etc.) in
the current locale.

To put it another way: you can write code that determines whether
an input character $c matches a glob or regex like [Z-a].  (Maybe.)

But, you CANNOT write code to generate all of the characters from Z to a.

Since this thread is about brace expansion, which must generate
characters, the feature you're looking for is simply impossible, to
the best of my knowledge.  (I'd be delighted for you to prove me
wrong.  Show me how to generate all of the :alpha: characters in the
en_US.utf8 locale in perl, or python, or any other language.)

[1] The only way I know to get that information would be to take as input
*every conceivable character*, and, one by one, check whether each
of those characters matches the :alpha: class.  Such a brute force
solution is not in the spirit of the mission.  As such, I'll save you
the time and do that part myself.

wooledg:~$ for ((i=1; i<=200; i++)); do printf -v tmp %04x "$i"; printf -v c 
"\\u$tmp"; if [[ $c = [[:alpha:]] ]]; then printf %s "$c"; fi; done; echo
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈ

Obviously I did not use *every conceivable character* as input -- just
a couple hundred, a completely arbitrary cut-off point, because this is
just a proof of concept.  Trawling the entire Unicode code point space
is left as an adventure for braver souls than mine.  As is comparing
the different locales on a system, or the same locale between different
operating systems.

Sorting these characters is also possible, once they have been generated.
This is (I think!) what allows things like [Z-a] to work at all: you
can check whether $c is >= 'Z' and <= 'a', without knowing what all of
the characters in between are.  But you can't ask "what comes after Z".

wooledg:~$ for ((i=1; i<=200; i++)); do printf -v tmp %04x "$i"; printf -v c 
"\\u$tmp"; if [[ $c = [[:alpha:]] ]]; then printf %s\\n "$c"; fi; done | sort | 
tr -d \\n; echo
aAªÁÀÂÅÄÃÆbBcCÇdDeEÈfFgGhHiIjJkKlLmMnNoOºpPqQrRsStTuUvVwWxXyYzZµ

Again, this is only PART of the set, and is not intended to be a
complete enumeration of the :alpha: characters in my system's locale.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]