[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 printf string formating problem
From: |
Pádraig Brady |
Subject: |
Re: UTF-8 printf string formating problem |
Date: |
Mon, 07 Apr 2014 14:14:28 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 04/06/2014 12:56 PM, Dan Douglas wrote:
> On Sunday, April 06, 2014 01:24:58 PM Jan Novak wrote:
>> To solve this problem I suppose to add "wide" switch to printf
>> or to add "%S" format (similarly to wprintf(3) )
>
> ksh93 already has this feature using the "L" modifier:
>
> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
> ★★★
> bash -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
> ★
>
> Also, zsh does this by default with no special option. I tend to lean towards
> going by character anyway because that's what most shell features such as
> "read -N" do, and most work directly involving the shell is with text not
> binary data.
So we can count bytes, chars or cells (graphemes).
Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
In the edge case where you want to deal in bytes one can do:
LC_ALL=C printf ...
I see that ksh behaves as I would expect and counts cells,
though requires the explicit %L enabler:
$ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
á★★
$ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
A★
$ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
A
zsh seems to just count characters:
$ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
á★
$ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
á★
$ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
A★★
GNU awk seems to just count characters:
$ awk 'BEGIN{printf "%.3s\n", "A★★★"}'
A★★
I see that dash gives invalid directive for any of %ls %Ls %S.
Pity there is no consensus here.
Personally I would go for:
printf '%3s' 'blah' # count cells
printf '%3Ls' 'blah' # count chars
LANG=C '%3Ls' 'blah' # count bytes
LANG=C '%3s' 'blah' # count bytes
Pádraig.