bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#11197: problems with string ports and unicode


From: Mark H Weaver
Subject: bug#11197: problems with string ports and unicode
Date: Wed, 11 Apr 2012 13:53:21 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)

Hi Ludovic,

address@hidden (Ludovic Courtès) writes:
> Mark H Weaver <address@hidden> skribis:
>> address@hidden (Ludovic Courtès) writes:
>>> It may be that your string ports are created with a non-Unicode-capable
>>> encoding.  Try something like:
>>>
>>>   (define p
>>>     (with-fluids ((%default-port-encoding "UTF-8"))
>>>       (open-input-string "čtyří")))
>>
>> IMO, this should not be needed.  Port encodings should only be relevant
>> when reading from ports involving byte strings, such as file ports or
>> socket ports.  The encoding used by Scheme strings is a purely internal
>> matter; from the user's perspective, Scheme strings are simply a
>> sequence of Unicode code points.
>
> Note that “UTF-8” above has nothing to do with Guile’s internal string
> representation; it’s just one of the many encodings that can represent
> “čtyří”.

Okay, now I understand.  The problem is that internally, string ports
are implemented by converting the string into a stream of bytes in the
string port's encoding, and then the string port reads those bytes.

Nonetheless, it is very unfortunate that this internal implementation
detail "leaks" out into user code.  SRFI-6 says nothing about port
encodings, and portable code written for SRFI-6 will fail on Guile
unless the string is constrained to whatever the default port encoding
happens to be.

Conceptually, a string port is a textual port, not a binary port.  You
should be able to hand it an arbitrary string and read those characters
from it, as described in SRFI-6, without setting Guile-specific fluid
variables.  Similarly, you should be able to write arbitrary characters
to a string-output-port.

IMO, string ports should use UTF-8 as their initial port encoding, since
we know that UTF-8 can represent any Guile string.  This will allow
portable use of string ports.

I realize that this would change the existing behavior of programs that
use binary I/O on string ports, but as things stand right now, portable
SRFI-6 code is broken on Guile.

What do you think?

>> What _is_ needed is a file coding declaration near the top of the source
>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
>> the manual).
>
> Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
> magically make string ports use that encoding.
>
>> I tried that and it still fails for me.
>
> What fails exactly?

It fails ungracefully (goes into an infinite while trying to print the
backtrace) without the %default-port-encoding setting.  It works when I
add both the %default-port-encoding setting and the coding declaration.

     Thanks,
       Mark





reply via email to

[Prev in Thread] Current Thread [Next in Thread]