[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CVS and unicode

From: Christian Hujer
Subject: Re: CVS and unicode
Date: Wed, 7 Sep 2005 22:58:37 +0200
User-agent: KMail/1.7.1

Am Dienstag, 6. September 2005 01:17 schrieb Yves Dorfsman:
> Hi,
> Has anybody run into problem with GNU CVS and unicode ?
> I have made a few tests (with UTF8) and so far it worked, but some of my
> users are saying they did run into problem with some files. I can see how
> some legal UTF8 characters could be confused as control code/binary.
> Does anybody have extensive experience with this ?

Encoding problems are operating system / editor side.
CVS does not care about anything regarding the encoding.

Somebody wrote:
>> In CVS a Unicode file has to be a Binary file (-kb) - which prevents
>> merging, diffs, etc etc.  If you do not define it as -kb then eventually
>> the file will be corrupted. 
This is completely wrong and lacks any technical substance.
First of all, Unicode is not the file encoding at all, it's UTF-8 or UTF-16.
Now on the core. UTF-8 files needn't be binary files, in fact, if you want 
normal CVS behaviour in the way you're used to it for ASCII text files, they 
mustn't be binary files. The byte sequence of Strings like "$Revision$" is 
identical in UTF-8 encoded Unicode or plain US ASCII 7.
In fact, all US ASCII 7 encoded files are valid UTF-8 encoded Unicode files 
just as well as if you only use the first 128 Unicode code points, your UTF-8 
encoded Unicode text is valid ASCII. Even more, the texts are 100% identical 
up to the last bit.
Differences occur with extended encodings like ISO-8859-x (e.g. ISO-8859-1 or 
ISO-8859-15 etc.) or Windows CP-* (e.g. Windows CP-1252). In these encodings, 
the 128 ASCII code points are extended by 128 additional code points with the 
high bit set. In UTF-8, the set high bit indicates a multibyte character.
For instance, the lower case umlaut u (as occuring in German, Turkish and some 
more languages) has Unicode code point 252. The ISO-8859-1 code point is 252 
as well. But the byte sequence in UTF-8 and ISO-8859-1 are different.
In ISO-8859-1, the byte sequence is 0xFC, while in UTF-8 the byte sequence for 
the same symbolic character is 0xC3 0xBC.

The issue is not CVS, the issue is telling your editor about the correct file 
encoding. It's the text editor and how it interprets byte sequences.

On UNIX, most editors determin the default encoding from the language 
environment settings, which can be printed with the locale command. Refer to 
your UNIX system manual for more information. Most UNIXoides allow changing 
this setting by (warning: this example overrides all) LC_ALL=de_DE.UTF-8 as 
an example to set the locale to German for Germany using UTF-8 encoding. Be 
warned, only newly started processes (especially terminals!) will use this, 
so if you want to always use this, put it somewhere in your .profile 
or .bashrc.

The Unicode thingy in CVSNT is just a hack to work around operating system 
issues regarding MS Windows.

I'm using UTF-8 in tons of files (all my Java sources are UTF-8 encoded as 
well as most of my C++ sources and, of course, all my XML files) for years 
now, without any problems.

UTF-16 in fact can be problematic. Normal keyword substitution is likely to 
fail at least with some older versions of CVS. I don't know wether newer CVS 
uses wchar instead of char for keyword substitution. UTF-16 isn't in 
widespread use, so I didn't care about that yet.

Christian Hujer
E-Mail: address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]