bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#48114: Disarchive occasionally fails tests


From: Timothy Sample
Subject: bug#48114: Disarchive occasionally fails tests
Date: Mon, 03 May 2021 00:02:09 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Timothy Sample <samplet@ngyro.com> writes:

> I’m still looking into this, but I wanted to quickly post this
> reproducer for the Guile bug:
>
>     (use-modules (ice-9 regex))
>     (define str
> "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492")
>     (match:substring (string-match "[0-8]+" str))
>
> This triggers the out-of-range error when run with “LC_ALL=C”.

It turns out that all that’s needed is the last code point, which is
“Number Eleven Full Stop”, or ‘⒒’.  When Guile converts this to an ASCII
C string using ‘u32_conv_from_encoding’, it becomes “11.”.  The regex
(“[0-8]+”) matches the “11” part with start index 0 and end index 2.
The ‘fixup_multibyte_match’ function does nothing (it only matters when
the locale encoding is multibyte) [1].  Guile then builds the match
vector with the original string but keeps the ASCII offsets.  In other
words, it thinks the match substring goes from 0 to 2 in a single code
point string:

    ,use (ice-9 regex)
    (string-match "11" "\u2492")
    => #("\u2492" (0 . 2))

I’m not sure there’s any way to solve this nicely in Guile.  It would be
clearer if the match vector included the string as libc matched it, but
it’s still surprising that the match happens with a different string.

In Disarchive, I can rewrite the generator without regex.  I’ll do that
and see what I can do about the “Gave up!” issue.

[1] It works on the converted-to-ASCII C string, which means that the
byte offsets and code point offsets are the same.  Hence, it has nothing
to do.


-- Tim





reply via email to

[Prev in Thread] Current Thread [Next in Thread]