bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] iconv fails on large Greek files


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] iconv fails on large Greek files
Date: Sun, 02 Oct 2022 00:04:34 +0200

Hello Wesley,

> The failure also occurs when the file does have known decomposed characters.
> 
> WGroleau@MBP ~ % iconv --version
> iconv (GNU libiconv 1.16)
> WGroleau@MBP ~ % uname -a
> Darwin MBP.local 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 
> 2022; root:xnu-8020.140.49~2/RELEASE_X86_64 x86_64
> WGroleau@MBP el % wc el.txt                                                 
>      179     975    8621 el.txt
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 el.txt > /tmp/tmp              
> 
> iconv: el.txt:90:16: cannot convert
> WGroleau@MBP el % wc /tmp/tmp
>       89     457    4093 /tmp/tmp
> WGroleau@MBP el % iconv -f UTF-8 -t UTF8-MAC el.txt > /tmp/tmp
> WGroleau@MBP el % wc /tmp/tmp
>      179    1029    9537 /tmp/tmp
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 el.txt > /tmp/tmp
> 
> iconv: el.txt:90:16: cannot convert
> WGroleau@MBP el % wc el.txt
>      179     975    8621 el.txt                                  WGroleau@MBP 
> el % tail -$((179-90+2)) el.txt > el+.txt
> WGroleau@MBP el % wc el+.txt
>       90     522    4558 el+.txt
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 el+.txt > /tmp/tmp
> 
> iconv: el+.txt:84:36: cannot convert
> WGroleau@MBP el % wc /tmp/tmp
>       83     469    4093 /tmp/tmp
> WGroleau@MBP el % iconv -f UTF-8 -t UTF8-MAC el.txt > /tmp/tmp
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 /tmp/tmp > temp.txt
> 
> iconv: /tmp/tmp:161:7: cannot convert
> WGroleau@MBP el % wc temp.txt
>      160     835    7390 temp.txt
> WGroleau@MBP el % wc /tmp/tmp
>      179    1029    9537 /tmp/tmp

The failures occur only when you use the 'UTF8-MAC', apparently.
Then you need to complain to Apple. Because GNU libiconv does not
have this encoding name; it was added by Apple in the macOS version
of GNU libiconv.

> The failure usually occurs after processing APPROX. 4000 bytes,
> but occasionally approx. 8000.

When I decided to not integrate Apple's code upstream, it was because
  * UTF8-MAC is a workaround to Apple's misdesign decisions: Although
    the W3C says that decomposed Unicode should not be user-visible,
    Apple made it user-visible in HFS+. They better ought to have hidden
    it in their file system routines.
  * The code that Apple added to GNU libiconv looked buggy to me. I am
    not surprised at all that you have succeeded in finding a reproducer
    for these bugs. Probably you are the first one because most people
    use iconv in this way only to convert file names, and file names are
    smaller than 4000 bytes.

As a workaround, you can use 'uconv -x NFC' where uconv is a program
part of ICU.

Best regards,

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]