[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode range and enumeration support.

From: Greg Wooledge
Subject: Re: Unicode range and enumeration support.
Date: Mon, 23 Dec 2019 08:20:49 -0500
User-agent: Mutt/1.10.1 (2018-07-13)

On Fri, Dec 20, 2019 at 04:35:05PM -0800, L A Walsh wrote:
> On 2019/12/18 11:46, Greg Wooledge wrote:
> > To put it another way: you can write code that determines whether
> > an input character $c matches a glob or regex like [Z-a].  (Maybe.)
> > 
> > But, you CANNOT write code to generate all of the characters from Z to a
> This generates characters from decimal 8300 - 8400 (because that range
> includes raised and lowered digits which have the number and value
> properties equivalent to 0-9.
> ----
> No? 8300, 8400 arbitrary code points that contain raised and lowered numbers
> that have the number property (as does 0..9):
> perl -we' use strict; use v5.16;
> my $c;
> for ($c=8300;$c<8400;++$c) {

As I said in the previous message, a brute force solution that enumerates
the ENTIRE Unicode code point space is not a valid answer.  I even gave
a bash program that does something very similar to your perl program,
just using a different segment of the code point space.

Given that both of us are capable of generating such a brute force
solution, how did you INTEND to use that solution to solve the actual
problem, which is "in an arbitrary locale, list all of the characters
from $start to $end in collating order"?

You can't simply translate $start and $end to single Unicode code point
values, enumerate the Unicode characters between those two points,
and translate those characters back to the user's locale.  That doesn't
give you the correct answer.  There will be extra characters in the
Unicode code point range that don't fit the solution, AND there will
be characters outside the Unicode code point range that SHOULD be in
the solution, but are missed.

The only way to do it is to iterate over the ENTIRE code point space,
however many millions or billions of characters that is today.
Translate each of those millions of characters back into the user's
locale, check whether that character sorts after $start, check whether
that character sorts before $end, and include/exclude it from the
solution set.  Then, when you have the solution set, sort it one final
time to get it in order.

Is that what you are proposing bash should do, in order to get a working
brace expansion outside of the C locale?  I don't believe this is an
acceptable solution.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]