[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CVS and unicode

From: Christian Hujer
Subject: Re: CVS and unicode
Date: Sat, 10 Sep 2005 12:38:19 +0200
User-agent: KMail/1.7.1


Am Donnerstag, 8. September 2005 10:59 schrieb Arno Schuring:
> > Encoding problems are operating system / editor side.
> > CVS does not care about anything regarding the encoding.
> Except that the diffs between files are computed byte-wise, not
> character-wise. This could lead to problems when multi-byte characters
> occur, though these problems are rare at best. My guess is that merging
> will succeed most of the time, but the actual diff might contain invalid
> byte sequences (although - considering cvs' line-by-line diff, that may
> never occur).
Because CVS does not do byte-based but line-based comparison, fragmented 
multi-byte characters will not occur.

> > Somebody wrote:
> >>> In CVS a Unicode file has to be a Binary file (-kb) - which prevents
> >>> merging, diffs, etc etc.  If you do not define it as -kb then
> >>> eventually the file will be corrupted.
> >
> > This is completely wrong and lacks any technical substance.
> Both pretty firm statements. And both partially right, partially wrong.
> There has yet to be found the first multi-byte sequence that leads to
> merging/diffing problems, but since CVS was not designed for multi-byte
> characters, there may very well be one (+).
No, there won't be one.
Multibyte characters do never occur on line endings. They only occur within a 
line. CVS always diffs line-based. This way, CVS is guaranteed to never 
fragment a multi-byte character; CVS is guaranteed to always keep the bytes 
of multi-byte characters together.
Remember, all bytes of multi-byte characters have the high bit set, while the 
single byte characters have the high bit cleared. Thus, CR or NL (formerly 
LF) cannot ever occur within a multi-byte sequence and thus break a UTF-8 

> > The issue is not CVS, the issue is telling your editor about the correct
> > file
> > encoding. It's the text editor and how it interprets byte sequences.
> I agree partially. Sure, the editor must support multi-byte characters. But
> I think it is an error for CVS to start supporting unicode without
> supporting the environment's native encoding. "we only support unicode if
> you manually save all your text files in UTF-8" does not consitute full
> unicode support, I think.
For me it does.
To me, Unicode support means finding a way to support using all Unicode 
characters. That's done.
Supporting thousands of old legacy encodings like Windows CP 1252, ISO-8859-x, 
euc-kr, koi8-r, gbk etc. etc. is not the task of a tool like CVS.
In fact, I am at the point of thinking that CVS even MUST NOT do so. 
Currently, CVS has extremely tolerant behaviour regarding binary files which 
were accidently added as text files. As long as they do not contain keywords 
(like $Id...$), they are extremely likely to still be handled conveniently. 
The -kb for disabling keyword substitution is only really needed in those 
rare cases where a byte sequence that looks like a keyword occurs in a file.

It is even possible to calculate the chance in a mathematical way, assuming 
that binary files consist of mostly random files (which, from this point of 
view, is more or less true). The chance for $Id occuring in a binary file is 
1/((2^8)^3) * (length of file - 2). Use this for all: Sum(over all keywords) 
of 1/((2^8)^(l + 1)) * (fl - l) with fl being the file length and l being the 
keyword length. You'll see that the chance is extremely unlikely. The -kb 
option is just for these rare cases.

> > The Unicode thingy in CVSNT is just a hack to work around operating
> > system issues regarding MS Windows.
> I have no idea what unicode support in CVSNT does. I always thought it was
> to prevent invalid CR/LF conversions in multi-byte characters. For example,
> code point 522 is encoded in UTF-16 as 0x01 0x0A. But, I never used CVSNT
> and don't know whether they even support UTF16 (both BE and LE) or only
> UTF-8. The conversion problem does not even occur in UTF-8, because all
> multi-byte components are >= 0x80.
From my point of view, all this stuff is not neccessary. Simply always UTF-8 
and everything will be fine. No need for hacks like CVSNT. This is, of 
course, not meant to degrade the good work the CVSNT developers did with 
CVSNT. They are just trying to find a solution for those users that for some 
reason don't use the simple UTF-8 only solution, which, in most cases, is a 
lack of knowledge about encodings more than anything else.

> > I'm using UTF-8 in tons of files (all my Java sources are UTF-8 encoded
> > as well as most of my C++ sources and, of course, all my XML files) for
> > years now, without any problems.
> I do the same, and have not encountered problems so far. But that does not
> mean there are no issues. What you and I are using is a workaround to use
> unicode with CVS.
I wouldn't call it workaround because UTF-8 is designed show the very best 
processing possible even with byte-wise processing.

> > UTF-16 in fact can be problematic. Normal keyword substitution is likely
> > to
> > fail at least with some older versions of CVS. I don't know wether newer
> > CVS
> > uses wchar instead of char for keyword substitution. UTF-16 isn't in
> > widespread use, so I didn't care about that yet.
> Maybe it would be worth investigating handling all character set
> conversions by the client, and using UTF-8 for all repository files always.
> For as far as you and I can attest, there are (so far) no issues with
> handling UTF-8 files. That way, utf-16 might not be a problem. But then
> again, if the character support is already in the client, how much effort
> would it take to move it to the server too? But I'm not devloping cvs, so I
> will kindly stop talking now ;)
I'm a CVS developer neither, but as already explained, I am against moving the 
encoding issue from the editor to CVS, because this is extremely likely to 
introduce more problems than it solves. If the cvs client would do the 
conversion, binary files accidently added without -kb would get instantly 
broke. If the encoding is guessed wrong, characters would instantly get lost 
or encoded in the wrong way.

With the current solution where CVS only handles bytes, even for text files, 
and treats them as ASCII regarding the keyword substitution and diff (for 
finding the end of line), I can carelessly checkout ISO-8859-x files on a 
UTF-8 system or the otherway round and simply tell my editor about the 
encoding and everything will be fine.

Also think about the issues of having to convert all existing repositories or 
introducing another switch in the CVSROOT/ files, and what if the setting 
should be per module?

Imo all this would introduce much more issues than it would solve, imo CVS is 
good as it is (regarding the encoding issues).

Cu :)
Christian Hujer
E-Mail: address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]