monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] sketch of i18n specification


From: graydon hoare
Subject: [Monotone-devel] sketch of i18n specification
Date: 18 Nov 2003 11:58:30 -0500
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2

hi,

I've been digging through the various documents on the matter of
larger names and different character sets, and think I've got a rough
transition plan in mind for i18n support. it only breaks one thing,
and it's very minor (explicit rename certs need to be
reissued). otherwise it's just a bunch of hammering out code. the
following document will, modulo editing and corrections be README.i18n
in a near-future version of monotone.

please let me know what's wrong with it.

-graydon


monotone internationalization specification
===========================================

0. general terms:

  character set conversion: the process of mapping a string of bytes
  representing wide characters (under the encoding described in the
  LC_CTYPE locale category) to or from an "internal form", which is a
  sequence of bytes in the UTF-8 encoding.

  line ending conversion: the process of converting platform-dependent
  end-of-line UTF-8 codes (0x0D, 0x0A, or the pair 0x0D 0x0A) to or
  from an "internal form", which represents end-of-line using only
  0x0A.

  stringprep: RFC 3454, a general framework for mapping, normalizing,
  prohibiting and bidirectionality checking for international names
  prior to use in public network protocols.

  nameprep: RFC 3491, a specific profile of stringprep, used for
  preparing international domain names (IDNs)

  punycode: RFC 3492, an ASCII-compatible encoding (ACE) of unicode,
  used to transmit unicode values as an "unlikely" subset of ASCII, in
  legacy applications which prefer to see ASCII input. not acceptable
  for general binary data, but generally considered acceptable for
  human-consumed names.

  
1. filenames:

  - filenames are subject to character set conversion. note that the
    LC_CTYPE locale category may be insufficient to determine the
    native encoding, as some filesystems embed the encoding in the
    filesystem itself. for example older DOS filesystems have NLS
    tags, and Windows 95 (and newer) use UCS2 everywhere.

  - filenames are subject to an additional processing stage which
    normalizes platform name semantics, for example changing the
    Windows 0x5C '\' path separator to 0x2F '/'.

  - FIXME: what do we do about case sensitivity on Windows?

  - the internal form of filenames has additional structural
    restrictions:

    - a filename is a sequence of nonempty path components, separated
      by byte 0x2F (ASCII / ), and without a leading or trailing 0x2F

    - a path component is a sequence of any UTF-8 character codes
      except: 
        all codes less than 0x20 (ASCII SPACE)
        0x22 (ASCII " )
        0x2A (ASCII * )
        0x2F (ASCII / )
        0x3A (ASCII : )
        0x3C (ASCII < )
        0x3E (ASCII > )
        0x3F (ASCII ? )
        0x5C (ASCII \ )
        0x7C (ASCII | )
        0x7F (ASCII DEL)

  - manifests are constructed from the internal form (UTF-8). the
    LC_COLLATE category is *not* used to sort manifest entries.



2. file contents:

  - files are subject to character set conversion if they have a 
    persistent attribute "charconv" set on them, with value "true".

  - if a file has the persistent attribute "charset", its value will
    be used instead of the LC_CTYPE locale setting.

  - files are subject to line ending conversion, in the internal form,
    if they have a persistent attribute "lineconv" set on them, with
    value "true"

  - if a file has the persistent attribute "lineend", its value will
    be used instead of the platform specific line ending value.

  - as an abbreviation, setting the persistent attribute "text" with
    value "true" will enable both character and line ending conversion.

  - file SHA1 values are calculated from the internal form.


3. UI messages:

  - UI messages are displayed via calls to gettext(). 


4. URLs:

  - URLs are subject to character set conversion and nameprep. each
    component of a URL may also be subject to a different external
    form when interacting with network services:

    - host names: converted to punycode before DNS lookup, as
      described by IDNA working group.

    - URL in HTTP protocol request: URL-encoded (%xx) for characters
      outside the 60 ASCII characters described in RFC 2396.

    - URL in SMTP / NNTP protocol request: group names and mail
      addresses split at the following UTF-8 delimeters:
 
      SP / %x00-1F / "." / "@" / "+" / "%" / "=" / "/" / "," / ";" / ":"
      / "!" / "(" / ")" / "[" / "]" / "<" / ">"

      transformed to punycode, and re-joined, as described in
      draft-faerber-i18n-email-netnews-names-00.txt

5. cert names:

  - subject to character set conversion and nameprep, with the
    additional prohibition on whitespace (stringprep table C.1.1,
    consisting of only UTF-8 code 0x20). cert names are encoded in
    packets as punycode.


6. cert values:

  - subject to character set and line ending conversion unless
    overridden by a hook.


7. key names:

  - same rules as cert names.


8. explicit rename certs:

  - incompatible change: whitespace delimiter changed to UTF-8 code
    0x0A, to permit UTF-8 code 0x20 in filenames.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]