[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CVS and unicode

From: Christian Hujer
Subject: Re: CVS and unicode
Date: Sat, 10 Sep 2005 12:51:55 +0200
User-agent: KMail/1.7.1


Am Donnerstag, 8. September 2005 00:39 schrieb Arthur Barrett:
> Christian,
> >>> In CVS a Unicode file has to be a Binary file (-kb) - which prevents
> >>>
> >>> merging, diffs, etc etc.  If you do not define it as -kb then
> >>> eventually the file will be corrupted.
> >
> >This is completely wrong and lacks any technical substance.
> Firstly don't mistake me for any Unicode/UTF-8/UTG-16 guru - I was
> simply trying to answer the question in a helpful way.
> This time I'm just trying to clear up a couple of things about what the
> CVSNT for Linux/Unix/Windows (free / GPL) implmentation of Unicode
> support can and can't do based on Christian's comments.  I hope the
> information is helpful to those following the discussion.

Forgetting -kb in UTF-8 files will not result in problems.

There are exactly two issues regarding the -kb thing: line-based diff and 
keyword substitution.

The keyword substitution will work 100% fine with UTF-8. Keywords are encoded 
in UTF-8 just like ASCII. The byte sequence for $Id$ is the same in ASCII and 
We know the rest in keywords is auto-generated or taken from the OS. The 
auto-generated part is ASCII, so it's UTF-8 compatible. The part taken from 
the OS, such as paths etc., should either be ASCII or the OS of the server 
would better use UTF-8 as encoding.

For UTF-16, there's an extremely small chance that Unicode characters encoded 
in UTF-16 are represented as bytes which give a meaningful CVS keyword 
substitution byte sequence in ASCII.

> > Now on the core. UTF-8 files needn't be binary files, in fact, if you
> want
> > normal CVS behaviour in the way you're used to it for ASCII text
> files, they
> > mustn't be binary files.
> Yes.  And that was the point of my original reply.  But you've certainly
> worded it better.
> > Differences occur with extended encodings like ISO-8859-x (e.g.
> ISO-8859-1 or
> > ISO-8859-15 etc.) or Windows CP-* (e.g. Windows CP-1252). In these
> encodings,
> With CVSNT the file will be checked in/out in UCS-2 (or UTF-16) encoding
> and internally stored as UTF-8 by the server.  You can also use an
> extended encoding  -- any encoding supported by the client-side iconv
> library can be used.  This allows you to specify that a file uses
> ISO-8859-1 and have it converted (by iconv) to the locale used by the
> current client.  This way a single user can checkout 10 files that each
> use different extended encodings and not have to change their
> environment variable for each file (and work out what to change it to).
I see many issues regarding the configuration, script usage (these are pretty 
solvable) and I see broken behaviour regarding binary files. What if a user 
adds a binary file and forgets -kb? Currently, this is nearly never a 
problem, just admin -kb and done, the chances that the binary file would have 
been corrupted by CVS are extremely low (see my response to Arno Shuring's 
post for a mathematical discussion of the chance).

> > The Unicode thingy in CVSNT is just a hack to work around operating
> system
> > issues regarding MS Windows.
> No (but it helps this too) - see your own next comment.
> > UTF-16 in fact can be problematic. Normal keyword substitution is
> likely to
> > fail at least with some older versions of CVS.
> Not just keyword substitution, but merges and diffs, line endings etc
> too.
I don't see problems regarding line endings. In UTF-16 it's e.g. 0x00 0x0A or 
0x0A 0x00 (depending on wether it's BE or LE) instead of 0x0A. So CVS will 
split a char during a diff on UTF-16-LE, but that's not a problem because it 
will be the same for every line, so if lines are put together again, the 0x00 
and 0x0A bytes will be attatched again.

> All versions of CVS other than CVSNT need to treat UTF-16 files as
> binary.
I agree as long as the chars within the UTF-16 files include those code points 
from those panes that are likely to result in keywords when looking at the 
resulting byte sequence.

> > uses wchar instead of char for keyword substitution. UTF-16 isn't in
> > widespread use, so I didn't care about that yet.
> UTF-16 is the native internal representation of text in the NT based
> versions of Windows (NT/2000/XP/2003) and in the Java and .NET bytecode
> environments, as well as in Mac OS X's Cocoa and Core Foundation
> frameworks.
I was unclear about my wording regarding "use". With "use" I meant use as an 
encoding for text files. The internal representation (e.g. char in Java) does 
not fall under that category.

Cu :)
Christian Hujer
E-Mail: address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]