[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#54893: guix-daemon, locale, LANG, and unicode in git tag names
From: |
Maxime Devos |
Subject: |
bug#54893: guix-daemon, locale, LANG, and unicode in git tag names |
Date: |
Wed, 13 Apr 2022 10:22:30 +0200 |
User-agent: |
Evolution 3.38.3-1 |
Attila Lendvai schreef op wo 13-04-2022 om 07:51 [+0000]:
> i'm not sure why the wrong locale breaks file-system walking and deleting,
> though.
>
> i assume if every function in guile uses/assumes the same locale (character
> encoding), then both directions through the guile FFI should be idempotent,
> no?
> and i think both ASCII and UTF-8 are idempotent wrt C bytes <-> scheme string
> conversions.
The problem is that the default character encoding is ANSI_X3.4-1968
(US-ASCII) and any bytes above 127 makes things non-ASCII.
Also, the string procedures internally always use UTF-8 (or possibly
ISO-85519-1 as an optimisation?), they are not raw bytes instead they
can be consideres a vector of characters (string-ref returns
characters, not bytes, and doesn't use byte positions).
> IOW, it's only the displaying of the chars that should be broken,
> not file operations.
LANG=bogus guile
(guile-user)> (setlocale LC_ALL)
(guile-user)> (use-modules (ice-9 i18n))
(guile-user)> (locale-encoding)
(guile-user)> (locale-encoding)
$2 = "ANSI_X3.4-1968"
Apparently the fallback encoding is ‘ANSI_X3.4-1968’. Let's take a
look at this encoding. According to IANA
(https://www.iana.org/assignments/character-sets/character-sets.xhtml),
this character encoding can also be named ‘US-ASCII’ and is specified
in RFC2046. Some excerpts:
"US-ASCII" does not indicate an arbitrary 7-bit
character set[sic], but specifies that all octets in the body must
be interpreted as characters according to the US-ASCII character
set.
so it looks like, say, é cannot be encoded as US-ASCII, it does not
belong to the character set of the encoding. More generally, anything
beyond the 127 (Unicode) codepoint cannot be encoded in ANSI_X3.4-1968.
Let's test this (in a new REPL with an UTF-8 locale):
((@ (ice-9 iconv) string->bytevector) "é" "ANSI_X3.4-1968")
ice-9/boot-9.scm:1669:16: In procedure raise-exception:
Throw to key `encoding-error' with args `("put-char" "conversion to port
encoding failed" 84 #<output: string 7fd5bbc23ee0> #\é)'.
((@ (ice-9 iconv) string->bytevector) "é" "ANSI_X3.4-1968" 'substitute)
$2 = #vu8(63)
((@ (rnrs bytevectors) utf8->string) #vu8(63))
$3 = "?"
and the other direction:
((@ (ice-9 iconv) bytevector->string) #vu8(128) "ANSI_X3.4-1968" 'substitute)
$5 = "�" ;; why #\� and not #\?? I don't know, I guess Guile is inconsistent
(FWIW, I would throw an decoding-error here instead of silently corrupting the
file names.)
Greetings,
Maxime.
signature.asc
Description: This is a digitally signed message part
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Attila Lendvai, 2022/04/12
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Maxime Devos, 2022/04/12
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Attila Lendvai, 2022/04/13
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Maxime Devos, 2022/04/13
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Attila Lendvai, 2022/04/13
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Attila Lendvai, 2022/04/19
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Maxime Devos, 2022/04/19
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Maxime Devos, 2022/04/19
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names,
Maxime Devos <=
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Liliana Marie Prikler, 2022/04/13
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Maxime Devos, 2022/04/13
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Maxime Devos, 2022/04/13
bug#54893: [PATCH] guix: git-download: Set locale to deal with Unicode in git metadata., Attila Lendvai, 2022/04/19
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Ludovic Courtès, 2022/04/20
- bug#54893: guix-daemon, locale, LANG, and unicode in git tag names, Ludovic Courtès, 2022/04/20
- Prev by Date:
bug#54893: guix-daemon, locale, LANG, and unicode in git tag names
- Next by Date:
bug#54893: guix-daemon, locale, LANG, and unicode in git tag names
- Previous by thread:
bug#54893: guix-daemon, locale, LANG, and unicode in git tag names
- Next by thread:
bug#54893: guix-daemon, locale, LANG, and unicode in git tag names
- Index(es):