info-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cvs cannot process Chinese characters?


From: Mark D. Baushke
Subject: Re: cvs cannot process Chinese characters?
Date: Tue, 19 Oct 2004 18:26:14 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

address@hidden writes:

> I have some text files which contains Chinese characters.

Hmmm... this can get tricky depending on the exact encoding you are
using. The ccvs/doc/RCSFILES has this to say about non-ISO-8859
character encodings:

| Here is a clarification regarding characters versus bytes in certain
| character sets like JIS and Big5:
| 
|     The RCS file format, as described in the rcsfile(5) man page, is
|     actually byte-oriented, not character-oriented, despite hints to
|     the contrary in the man page.  This distinction is important for
|     multibyte characters.  For example, if a multibyte character
|     contains a `@' byte, the `@' must be doubled within strings in RCS
|     files, since RCS uses `@' bytes as escapes.
| 
|     This point is not an issue for encodings like ISO 8859, which do
|     not have multibyte characters.  Nor is it an issue for encodings
|     like UTF-8 and EUC-JIS, which never uses ASCII bytes within a
|     multibyte character.  It is an issue only for multibyte encodings
|     like JIS and BIG5, which _do_ usurp ASCII bytes.
| 
|     If `@' doubling occurs within a multibyte char, the resulting RCS
|     file is not a properly encoded text file.  Instead, it is a byte
|     stream that does not use a consistent character encoding that can
|     be understood by the usual text tools, since doubling `@' messes
|     up the encoding.  This point affects only programs that examine
|     the RCS files -- it doesn't affect the external RCS interface, as
|     the RCS commands always give you the properly encoded text files
|     and logs (assuming that you always check in properly encoded
|     text).

Howerver, I suspect in your case that you may also have been impacted by
the line-ending conversion convention between windows and the more native
UNIX format typically assumed by RCS format files.

> I imported them to cvs as text files.

On a UNIX box, that should work okay. On a windows box, I suspect that
the line-endings conversion could cause problems... that is a multi-byte
characters that has a hex 0xd byte in it.

> Now I check them out, and every file has been changed greatly.
> They are anything but what I want.

Okay.

> Shall I treat them as binary files to get the correct result?

Probably... It may depend on the version of CVS you are actually using.

 For the CVS from cvshome.org, you will probably need to use -kb
(binary) mode for those files.

If you are using CVSNT, then you might be able to use -ku to specify
that the file be treated as Unicode so that the file will be checked
in/out in UCS-2 (or UTF-16) encoding and internally stored as UTF-8 by
the server.

        Good luck,
        -- Mark
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (FreeBSD)

iD8DBQFBdb623x41pRYZE/gRAlk8AJwOPHFYK0Y5FyNRikajcp39zHOf6wCgkdJ1
0Aynkhyczp6bvJpHkUsYD+E=
=Y9h4
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]