guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iconv or something like that


From: Mark H Weaver
Subject: Re: iconv or something like that
Date: Thu, 23 Oct 2014 14:00:31 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.94 (gnu/linux)

Konrad Makowski <address@hidden> writes:
> Is there any solution to convert charset from one encoding to another?

Yes, but character encodings are only relevant when converting between a
sequence of _bytes_ (a bytevector), and a sequence of _characters_ [*]
(a string).  These conversions happen implicitly while performing I/O,
converting Scheme strings to/from C, etc.

[*] More precisely, Scheme strings are sequences of unicode code points.

It doesn't make sense to talk about the encoding of a Scheme string, or
to convert a Scheme string from one encoding to another, because they
are not byte sequences.

It sounds like you already have a Scheme string that was incorrectly
decoded from bytes, and are asking how to fix it up.  Unfortunately,
this won't work, because many valid ISO-8859-2 byte sequences are not
valid UTF-8, and will therefore lead to decoding errors.

> I have database in iso-8859-2 but my script runs in utf-8. I use dbi module.

Having looked at the guile-dbi source code, I see that it always uses
the current locale encoding when talking to databases.  Specifically, it
always uses 'scm_from_locale_string' and 'scm_to_locale_string'.  For
your purposes, you'd like it to use 'scm_from_stringn' and
'scm_to_stringn' instead, with "ISO-8859-2" as the 'encoding' argument.

My knowledge of modern databases is limited, so I'm not sure how this
problem is normally dealt with.  It seems to me that, ideally, strings
in databases should be sequences of Unicode code points, rather than
sequences of bytes.  If that were the case, then this problem wouldn't
arise.

It would be good if someone with more knowledge of databases would chime
in here.

In the meantime, I can see a few possible solutions/workarounds:

* Enhance guile-dbi to include an 'encoding' field to its database
  handles, add a new API procedure to set it, and use it in all the
  appropriate places.  This only makes sense if database strings are
  conceptually byte sequences, otherwise it should probably be fixed in
  some other way.

* Hack your local copy of guile-dbi to use 'scm_from_stringn' and
  'scm_to_stringn' with a hard-coded "ISO-8859-2" in the appropriate
  places.

* Use 'setlocale' to set a ISO-8859-2 locale temporarily while
  performing database queries.

Which database are you using?

     Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]