Here is my proposal of how *I* think a CM system should handle the "encoding
issue" and some related issues. You may have a different opinion, and if
you do it'd be nice to hear it, but no trolling, please.
(See the "Notes" section below for comments regarding each point.)
A) There should be support for both mandatory and optional metadata
attributes associated with each file in the repository.
B) "Content-Type" should be a mandatory metadata string attribute.
C) "Auto-Filter" should be a mandatory metadata boolean attribute.
D) There should be a filter/plugin architecture to enable a transcoding of
files on input and output based on their content-types and user settings
and user-provided parameters.
E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should be
provided by plugins mapped to content-types.
F) Commit comments and other string attributes should use UTF-8.
G) Filenames and paths should use UTF-8 in the repository, and be transcoded
to the proper encoding when a client accesses the local file system.
Notes:
A) There are already some mandatory metadata associated with each file. One
such attribute is the name of the file.
B) The MIME Content-Type is defined mainly in RFC 2045 and RFC 2046.
All text/* types may include the "charset" parameter (MIME defines "charset"
as "character encoding" and not just as a simple character set), and if
absent it is assumed to be "us-ascii" (i.e. "ANSI X3.4-1986 as 8 bits/char
with the most significant bit set to 0 (zero)"), as per RFC 2046.
This is a very common and established standard used in many different
systems including, but not limited to, file managers, http and email.
C) If Auto-Filter is set to "true" then content transcoding will occur
between the repository and the local system. If it is set to "false" then
no transcoding is done.
Each project may have its own default Auto-Filter values for different file
types.
D) Since editors and other programmers' tools tend to use whatever the local
system encoding happens to be and a project might include people with
different systems there needs to be some transcoding of most text files.
The contents of files whose "Auto-Filter" attribute is set to "true" will be
stored UTF-8 encoded with U+2028 newlines in the repository and transcoded
from/to the local encoding and local newlines on input/output. The contents
of files whose "Auto-Filter" attribute is set to "false" will not be
transcoded on input/output.
Often the proper local encoding and line breaks can be detected
automatically, but the user should be able to override the auto-detection
in his settings and/or by a parameter to the cm client.
E) E.g. if two files with the content-type "application/vnd.sun.xml.writer"
are diffed the system should use a diff plugin that knows how to interpret
OpenOffice.org Writer documents. If no such plugin is found it defaults to
the standard diff which regards the files as byte blobs.
F) UTF-8 should be used for communication between the client and the server.
Internally the server might store the strings in any encoding it wants in
the repository, but I'd recommend keeping them UTF-8 encoded for simplicity
and consistency.
G) Each character in a file name/path not possible to transcode to the
target file system encoding should be replaced with the character sequence
"{uN}" where N is the hexadecimal unicode code (e.g. a file named
"hello<>world" would be named "hello{u3C}{u3E}world" on windows). This
results in the limitation that filenames must not contain a character
sequence matched by the regexp pattern "\{u[0-9A-Fa-f]+\}".
Whenever a filename or path is used in an URI the UTF-8 bytes should be
properly URI-encoded.
Often the proper local encoding can be detected automatically, but the user
should be able to override the auto-detection in his settings and/or by a
parameter to the cm client.
Internally the server might store the strings in any encoding it wants in
the repository, but I'd recommend keeping them UTF-8 encoded for simplicity
and consistency.
Notice that there is no distinction between "text files" and "binary files".
The same system that converts between different text encodings might just
as well be used to convert between different "raw" audio formats. Just add
the appropriate plugin/filter and you're set.
- Marcus Sundman
_______________________________________________
Gnu-arch-users mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/gnu-arch-users
GNU arch home page:
http://savannah.gnu.org/projects/gnu-arch/