bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode range and enumeration support.


From: Eli Schwartz
Subject: Re: Unicode range and enumeration support.
Date: Wed, 18 Dec 2019 15:08:20 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.3.0

On 12/18/19 2:46 PM, Greg Wooledge wrote:
> Sorting these characters is also possible, once they have been generated.
> This is (I think!) what allows things like [Z-a] to work at all: you
> can check whether $c is >= 'Z' and <= 'a', without knowing what all of
> the characters in between are.  But you can't ask "what comes after Z".
> 
> wooledg:~$ for ((i=1; i<=200; i++)); do printf -v tmp %04x "$i"; printf -v c 
> "\\u$tmp"; if [[ $c = [[:alpha:]] ]]; then printf %s\\n "$c"; fi; done | sort 
> | tr -d \\n; echo
> aAªÁÀÂÅÄÃÆbBcCÇdDeEÈfFgGhHiIjJkKlLmMnNoOºpPqQrRsStTuUvVwWxXyYzZµ
> 
> Again, this is only PART of the set, and is not intended to be a
> complete enumeration of the :alpha: characters in my system's locale.

There's no need to sort ASCII characters, though, since the collation
order of [A-z] in the C locale is defined by their numeric codepoint
order. That is a guarantee that doesn't follow through in other locales.

So all bash needs to do to print {Z..a} is to take Z == ASCII decimal 90
and a == ASCII decimal 97, then enumerate the numbers 90-97 and
translate them into ascii. No locale awareness is needed, no heuristics,
no invocation of the locale subsystem, you don't even need to hardcode
the ASCII range in source code.

And that's why bash can support enumerating a range of ASCII characters
in LC_COLLATE=C order, when it cannot (easily) do so using other locales.

-- 
Eli Schwartz
Arch Linux Bug Wrangler and Trusted User

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]