bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’


From: Timothy Sample
Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
Date: Mon, 03 Jun 2019 10:24:40 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)

Hi Ludo,

Ludovic Courtès <address@hidden> writes:

> Hi Timothy,
>
> Timothy Sample <address@hidden> skribis:
>
>> Here’s a patch for Guile that uses explicit lists of characters in the
>> ‘(web uri)’ module instead of character ranges.  It includes two tests
>> that are pretty verbose, but seem to do the trick.
>>
>> I have a bit more background on the problem, mostly coming from a Glibc
>> bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.
>>
>> It turns out that it is well-known upstream, and avoiding character
>> ranges is the recommended approach for know.  Some other GNU tools have
>> adopted what is being called the “Rational Range Interpretation”
>> <https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
>> AIUI, this means they use the underlying encoding numbers for ranges (I
>> checked the source, but I’m only mostly sure I read it right).  It looks
>> like the Glibc folks are unsure how to proceed on this (but are maybe
>> slightly leaning towards the “rational” approach).
>
> Great that you gleaned good references on this topic!
>
>> It’s all a pretty big mess, really.  I was hoping there would be some
>> obvious thing that would fix the problem more generally.  Short of
>> pulling in the Gnulib regex code or writing something in Scheme, it
>> looks like Guile is stuck where it is now.
>
> Yeah.  The alternative would be to not use regexps in this context, I
> guess.

I meant fixing regexes in other contexts, since I’m sure the URI module
is not the only Guile code ever that assumed “[a-z]” would only match
ASCII lowercase letters.

>> I’m unsure if the changes are considered “trivial” from a copyright
>> perspective.  It’s pretty close, but I think programmers tend to
>> underestimate here.  I’ve started the FSF copyright assignment process
>> either way, since is likely not my last Guile patch.  :)
>
> If the process is already underway, I think it’s fine to commit this
> patch (I would rather wait if it were longer and/or if we didn’t know
> each other already).

Sounds good!

>> From 7b02be4c050c7b17a0e2685e8e453295f798c360 Mon Sep 17 00:00:00 2001
>> From: Timothy Sample <address@hidden>
>> Date: Sun, 2 Jun 2019 14:41:20 -0400
>> Subject: [PATCH] Make URI handling locale independent.
>>
>> Fixes <https://bugs.gnu.org/35785>.
>>
>> * module/web/uri.scm (digits, hex-digits, letters): New variables.
>> (ipv4-regexp, ipv6-regexp, domain-label-regexp, top-label-regexp,
>> userinfo-pat, host-pat, ipv6-host-pat, port-pat, scheme-pat): Explicitly
>> list each character instead of using character ranges.
>> * test-suite/tests/web-uri.test: Add corresponding tests.
>
> [...]
>
>> +  (pass-if "http://www.example.com (sv_SE)"
>> +    (dynamic-wind
>> +      (lambda () #t)
>> +      (lambda ()
>> +        (with-locale "sv_SE.utf8"
>> +          (reload-module (resolve-module '(web uri)))
>> +          (uri=? (string->uri "http://www.example.com";)
>> +                 #:scheme 'http #:host "www.example.com" #:path "")))
>
> Aren’t ‘reload-module’ calls a leftover that can now be removed (also in
> the other test)?

I needed to reload the modules like that to make the tests fail without
the patch and pass with it.  My understanding is that the bug happens
at regex compile time, which happens when the module is loaded.  If I
don’t reload the module, the old URI code passes the tests, since the
regexes were compiled with a locale that does not trigger the bug.  It’s
a little wacky, sure, but it was the best idea I could come up with.

> For the sv_SE test, what about taking a host name with a ‘w’, since
> that’s the use case that allowed us to uncover this bug?

I thought I was being clever by using a “www” hostname, but apparently
it’s so normalized as to be invisible!  Feel free to change it to
something more obvious like “w.com” or whatever.


-- Tim





reply via email to

[Prev in Thread] Current Thread [Next in Thread]