info-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CVS and unicode


From: Arno Schuring
Subject: Re: CVS and unicode
Date: Thu, 8 Sep 2005 10:59:58 +0200

Am Dienstag, 6. September 2005 01:17 schrieb Yves Dorfsman:
Hi,

Has anybody run into problem with GNU CVS and unicode ?

I have made a few tests (with UTF8) and so far it worked, but some of my
users are saying they did run into problem with some files. I can see how
some legal UTF8 characters could be confused as control code/binary.

Does anybody have extensive experience with this ?
Yes.

Encoding problems are operating system / editor side.
CVS does not care about anything regarding the encoding.

Except that the diffs between files are computed byte-wise, not character-wise. This could lead to problems when multi-byte characters occur, though these problems are rare at best. My guess is that merging will succeed most of the time, but the actual diff might contain invalid byte sequences (although - considering cvs' line-by-line diff, that may never occur).


Somebody wrote:
In CVS a Unicode file has to be a Binary file (-kb) - which prevents
merging, diffs, etc etc.  If you do not define it as -kb then eventually
the file will be corrupted.
This is completely wrong and lacks any technical substance.

Both pretty firm statements. And both partially right, partially wrong. There has yet to be found the first multi-byte sequence that leads to merging/diffing problems, but since CVS was not designed for multi-byte characters, there may very well be one (+).

The issue is not CVS, the issue is telling your editor about the correct file
encoding. It's the text editor and how it interprets byte sequences.

I agree partially. Sure, the editor must support multi-byte characters. But I think it is an error for CVS to start supporting unicode without supporting the environment's native encoding. "we only support unicode if you manually save all your text files in UTF-8" does not consitute full unicode support, I think.

The Unicode thingy in CVSNT is just a hack to work around operating system
issues regarding MS Windows.

I have no idea what unicode support in CVSNT does. I always thought it was to prevent invalid CR/LF conversions in multi-byte characters. For example, code point 522 is encoded in UTF-16 as 0x01 0x0A. But, I never used CVSNT and don't know whether they even support UTF16 (both BE and LE) or only UTF-8. The conversion problem does not even occur in UTF-8, because all multi-byte components are >= 0x80.

I'm using UTF-8 in tons of files (all my Java sources are UTF-8 encoded as
well as most of my C++ sources and, of course, all my XML files) for years
now, without any problems.

I do the same, and have not encountered problems so far. But that does not mean there are no issues. What you and I are using is a workaround to use unicode with CVS.

UTF-16 in fact can be problematic. Normal keyword substitution is likely to fail at least with some older versions of CVS. I don't know wether newer CVS
uses wchar instead of char for keyword substitution. UTF-16 isn't in
widespread use, so I didn't care about that yet.

Maybe it would be worth investigating handling all character set conversions by the client, and using UTF-8 for all repository files always. For as far as you and I can attest, there are (so far) no issues with handling UTF-8 files. That way, utf-16 might not be a problem. But then again, if the character support is already in the client, how much effort would it take to move it to the server too? But I'm not devloping cvs, so I will kindly stop talking now ;)


Arno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]