[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Cut from xterm (iso-8859-{2,15}) and paste into buffer
From: |
Kenichi Handa |
Subject: |
Re: Cut from xterm (iso-8859-{2,15}) and paste into buffer |
Date: |
Mon, 19 Nov 2001 11:47:56 +0900 (JST) |
User-agent: |
SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) |
Eli Zaretskii <eliz@is.elta.co.il> writes:
> On Sat, 17 Nov 2001, Karl Eichwalder wrote:
>> > Do you happen to know what exactly does Emacs get as the raw string
>> > from the X selection, before it decodes it?
>>
>> I set it to "raw-text" and Emacs sees:
>>
>> %/1iso8859-15
> And what does it get in the latin-2 case?
As Latin-2 (i.e. ISO 8859-2) is one of approved charsets in
the spec of Compound-Text, it can't be encoded in the above
format, but have to be encoded by a proper designation
sequence conforming to ISO 2022.
> I don't have the ICCCM spec handy. Do you (or someone else) know if
> what xterm sends is a valid compound-text format?
I'll attach the spec included in X.V11R6.6 here.
---
Ken'ichi HANDA
handa@etl.go.jp
Compound Text Encoding
Version 1.1 X
Consortium Standard X Version 11, Release 6.4
Robert W. Scheifler
c Copyright 1989 by X Consortium
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documenta- tion
files (the ``Software''), to deal in the Software
without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, subli- cense,
and/or sell copies of the Software, and to permit persons to
whom the Software is furnished to do so, subject to the following
conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PUR- POSE AND NONIN-
FRINGEMENT. IN NO EVENT SHALL THE X CONSOR- TIUM BE LIABLE FOR
ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CON-
NECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Except as contained in this notice, the name of the X Con-
sortium shall not be used in advertising or otherwise to
promote the sale, use or other dealings in this Software
without prior written authorization from the X Consortium.
1. Overview
Compound Text is a format for multiple character set data, such
as multi-lingual text. The format is based on ISO standards for
encoding and combining character sets. Com- pound Text is
intended to be used in three main contexts: inter-client commun-
ication using selections, as defined in the Inter-Client
Communication Conventions Manual (ICCCM); window properties
(e.g., window manager hints as defined in the ICCCM);
and resources (e.g., as defined in Xlib and the Xt Intrinsics).
Compound Text is intended as an external representation, or
interchange format, not as an internal representation. It is
expected (but not required) that clients will convert
Compound Text to some internal representation for processing and
rendering, and convert from that internal representation to Com-
pound Text when providing textual data to another client.
2. Values
The name of this encoding is ``COMPOUND_TEXT''. When text values
are used in the ICCCM-compliant selection mechanism or are stored
as window properties in the server, the type used should be the
atom for ``COMPOUND_TEXT''.
Octet values are represented in this document as two decimal
numbers in the form col/row. This means the value (col * 16) +
row. For example, 02/01 means the value 33.
For our purposes, the octet encoding space is divided into four
ranges:
C0 octets from 00/00 to 01/15
GL octets from 02/00 to 07/15
C1 octets from 08/00 to 09/15
GR octets from 10/00 to 15/15
C0 and C1 are ``control character'' sets, while GL and GR are
``graphic character'' sets. Only a subset of C0 and C1 octets
are used in the encoding, and depending on the char- acter set
encoding defined as GL or GR, a subset of GL and GR octets may be
used; see below for details. All octets (00/00 to 15/15) may
appear inside the text of extended seg- ments (defined below).
[For those familiar with ISO 2022, we will use only an 8-bit
environment, and we will always use G0 for GL and G1 for GR.]
3. Control Characters
In C0, only the following values will be used:
00/09 HT HORIZONTAL TABULATION
00/10 NL NEW LINE
01/11 ESC (ESCAPE)
In C1, only the following value will be used:
09/11 CSI CONTROL SEQUENCE INTRODUCER
[The alternate 7-bit CSI encoding 01/11 05/11 is not used in Com-
pound Text.]
No control sequences are defined in Compound Text for chang- ing
the C0 and C1 sets.
A horizontal tab can be represented with the octet 00/09.
Specification of tabulation width settings is not part of Com-
pound Text and must be obtained from context (in an unspecified
manner).
[Inclusion of horizontal tab is for consistency with the STRING
type currently defined in the ICCCM.]
A newline (line separator/terminator) can be represented with the
octet 00/10.
[Note that 00/10 is normally LINEFEED, but is being inter- preted
as NEWLINE. This can be thought of as using the (deprecated) NEW
LINE mode, E.1.3, in ISO 6429. Use of this value instead of
08/05 (NEL, NEXT LINE) is for consistency with the STRING type
currently defined in the ICCCM.]
The remaining C0 and C1 values (01/11 and 09/11) are only used
in the control sequences defined below.
4. Standard Character Set Encodings
The default GL and GR sets in Compound Text correspond to the
left and right halves of ISO 8859-1 (Latin 1). As such, any
legal instance of a STRING type (as defined in the ICCCM) is also
a legal instance of type COMPOUND_TEXT.
[The implied initial state in ISO 2022 is defined with the
sequence:
01/11 02/00 04/03 GO and G1 in an 8-bit environment only.
Designation also invokes.
01/11 02/00 04/07 In an 8-bit environment, C1 represented as
8-bits.
01/11 02/00 04/09 Graphic character sets can be 94 or 96.
01/11 02/00 04/11 8-bit code is used.
01/11 02/08 04/02 Designate ASCII into G0.
01/11 02/13 04/01 Designate right-hand part of ISO Latin-1 into
G1. ]
To define one of the approved standard character set encod- ings
to be the GL set, one of the following control sequences is used:
01/11 02/08 {I} F 94 character set
01/11 02/04 02/08 {I} F 94N character set
To define one of the approved standard character set encod- ings
to be the GR set, one of the following control sequences is used:
01/11 02/09 {I} F 94 character set
01/11 02/13 {I} F 96 character set
01/11 02/04 02/09 {I} F 94N character set
The ``F''in the control sequences above stands for ``Final char-
acter'', which is always in the range 04/00 to 07/14. The
``{I}'' stands for zero or more ``intermediate charac-
ters'', which are always in the range 02/00 to 02/15, with the
first intermediate character always in the range 02/01 to 02/03.
The registration authority has defined an ``{I} F'' sequence for
each registered character set encoding.
[Final characters for private encodings (in the range 03/00 to
03/15) are not permitted here in Compound Text.]
For GL, octet 02/00 is always defined as SPACE, and octet 07/15
(normally DELETE) is never used. For a 94-character set defined
as GR, octets 10/00 and 15/15 are never used.
[This is consistent with ISO 2022.]
A 94N character set uses N octets (N > 1) for each charac- ter.
The value of N is derived from the column value for F:
column 04 or 05 2 octets
column 06 3 octets
column 07 4 or more octets
In a 94N encoding, the octet values 02/00 and 07/15 (in GL) and
10/00 and 15/15 (in GR) are never used.
[The column definitions come from ISO 2022.]
Once a GL or GR set has been defined, all further octets in that
range (except within control sequences and extended segments) are
interpreted with respect to that character set encoding,
until the GL or GR set is redefined. GL and GR sets can be
defined independently, they do not have to be defined in pairs.
Note that when actually using a character set encoding as the GR
set, you must force the most significant bit (08/00) of
each octet to be a one, so that it falls in the range 10/00 to
15/15.
[Control sequences to specify character set encoding revi- sions
(as in section 6.3.13 of ISO 2022) are not used in Compound
Text. Revision indicators do not appear to provide useful infor-
mation in the context of Compound Text. The most recent revision
can always be assumed, since revisions are upward compatible.]
5. Approved Standard Encodings
The following are the approved standard encodings to be used with
Compound Text. Note that none have Intermediate char-
acters; however, a good parser will still deal with
Intermediate characters in the event that additional encod- ings
are later added to this list.
_________________________________________________________________
{I} F 94/96 Description
_________________________________________________________________
4/02 94 7-bit ASCII graphics (ANSI X3.4-1968),
Left half of ISO 8859 sets
04/09 94 Right half of JIS X0201-1976 (reaffirmed
1984), 8-Bit Alphanumeric-Katakana Code
04/10 94 Left half of JIS X0201-1976 (reaffirmed
1984), 8-Bit Alphanumeric-Katakana Code
04/01 96 Right half of ISO 8859-1, Latin alphabet No.
1
04/02 96 Right half of ISO 8859-2, Latin alphabet No.
2
04/03 96 Right half of ISO 8859-3, Latin alphabet No.
3
04/04 96 Right half of ISO 8859-4, Latin alphabet No.
4
04/06 96 Right half of ISO 8859-7, Latin/Greek alpha-
bet
04/07 96 Right half of ISO 8859-6, Latin/Arabic
alphabet
04/08 96 Right half of ISO 8859-8, Latin/Hebrew
alphabet
04/12 96 Right half of ISO 8859-5, Latin/Cyrillic
alphabet
04/13 96 Right half of ISO 8859-9, Latin alphabet No.
5 2
04/01 942 GB2312-1980, China (PRC) Hanzi
04/02 942 JIS X0208-1983, Japanese Graphic
Character Set
04/03 94 KS C5601-1987, Korean Graphic Character Set
_________________________________________________________________
The sets listed as ``Left half of ...'' should always be
defined as GL. The sets listed as ``Right half of ...'' should
always be defined as GR. Other sets can be defined either as GL
or GR.
6. Non-Standard Character Set Encodings
Character set encodings that are not in the list of approved
standard encodings can be included using ``extended seg- ments''.
An extended segment begins with one of the follow- ing sequences:
01/11 02/05 02/15 03/00 M L variable number of octets per
character
01/11 02/05 02/15 03/01 M L 1 octet per character
01/11 02/05 02/15 03/02 M L 2 octets per character
01/11 02/05 02/15 03/03 M L 3 octets per character
01/11 02/05 02/15 03/04 M L 4 octets per character
[This uses the ``other coding system'' of ISO 2022, using
private Final characters.]
The ``M'' and ``L'' octets represent a 14-bit unsigned value giv-
ing the number of octets that appear in the remainder of the seg-
ment. The number is computed as ((M - 128) * 128) + (L - 128).
The most significant bit M and L are always set to one. The
remainder of the segment consists of two parts,
the name of the character set encoding and the actual text. The
name of the encoding comes first and is separated from the text
by the octet 00/02 (STX, START OF TEXT). Note that the length
defined by M and L includes the encoding name and separator.
[The encoding of the length is chosen to avoid having zero octets
in Compound Text when possible, because embedded NUL values are
problematic in many C language routines. The use of zero octets
cannot be ruled out entirely however, since some octets in the
actual text of the extended segment may have to be zero.]
The name of the encoding should be registered with the X Consor-
tium to avoid conflicts and should when appropriate match the
CharSet Registry and Encoding registration used in the X Logical
Font Description. The name itself should be encoded using ISO
8859-1 (Latin 1), should not use question mark (03/15)
or asterisk (02/10), and should use hyphen (02/13) only
in accordance with the X Logical Font Descrip- tion.
Extended segments are not to be used for any character set encod-
ing that can be constructed from a GL/GR pair of approved stan-
dard encodings. For example, it is incorrect to use an
extended segment for any of the ISO 8859 family of encodings.
It should be noted that the contents of an extended segment are
arbitrary; for example, they may contain octets in the C0 and C1
ranges, including 00/00, and octets comprising a given
character may differ in their most significant bit.
[ISO-registered ``other coding systems'' are not used in Compound
Text; extended segments are the only mechanism for non-2022
encodings.]
7. Directionality
If desired, horizontal text direction can be indicated using the
following control sequences:
09/11 03/01 05/13 begin left-to-right text
09/11 03/02 05/13 begin right-to-left text
09/11 05/13 end of string
[This is a subset of the SDS (START DIRECTED STRING) control in
the Draft Bidirectional Addendum to ISO 6429.]
Directionality can be nested. Logically, a stack of direc- tions
is maintained. Each of the first two control sequences pushes a
new direction on the stack, and the third sequence (revert) pops
a direction from the stack. The
stack starts out empty at the beginning of a Compound Text
string. When the stack is empty, the directionality of the text
is unspecified.
Directionality applies to all subsequent text, whether in GL,
GR, or an extended segment. If the desired directional- ity of
GL, GR, or extended segments differs, then direc- tionality con-
trol sequences must be inserted when switching between them.
Note that definition of GL and GR sets is independent of
directionality; defining a new GL or GR set does not change the
current directionality, and pushing or popping a direc- tionality
does not change the current GL and GR definitions.
Specification of directionality is entirely optional; text direc-
tion should be clear from context in most cases. How- ever, it
must be the case that either all characters in a Compound Text
string have explicitly specified direction or that all characters
have unspecified direction. That is, if directionality control
sequences are used, the first such control sequence must precede
the first graphic character in a Compound Text string, and
graphic characters are not per- mitted whenever the directional-
ity stack is empty.
8. Resources
To use Compound Text in a resource, you can simply treat all
octets as if they were ASCII/Latin-1 and just replace all ``''
octets (05/12) with the two octets ``\'', all newline octets
(00/10) with the two octets ``0', and all zero octets with
the four octets `` 00''. It is up to the client making use of
the resource to interpret the data as Compound Text; the policy
by which this is ascertained is not constrained by the
Compound Text specification.
9. Font Names
The following CharSet names for the standard character set encod-
ings are registered for use in font names under the X
Logical Font Description:
_____________________________________________________________________________________
Name Encoding Standard
Description
_____________________________________________________________________________________
ISO8859-1 ISO 8859-1
Latin alphabet No. 1
ISO8859-2 ISO 8859-2
Latin alphabet No. 2
ISO8859-3 ISO 8859-3
Latin alphabet No. 3
ISO8859-4 ISO 8859-4
Latin alphabet No. 4
ISO8859-5 ISO 8859-5
Latin/Cyrillic alphabet
ISO8859-6 ISO 8859-6
Latin/Arabic alphabet
ISO8859-7 ISO 8859-7
Latin/Greek alphabet
ISO8859-8 ISO 8859-8
Latin/Hebrew alphabet
ISO8859-9 ISO 8859-9
Latin alphabet No. 5
JISX0201.1976-0 JIS X0201-1976 (reaffirmed 1984) 8-
bit Alphanumeric-Katakana Code
GB2312.1980-0 GB2312-1980, GL encoding
China (PRC) Hanzi
JISX0208.1983-0 JIS X0208-1983, GL encoding
Japanese Graphic Character Set
KSC5601.1987-0 KS C5601-1987, GL encoding Korean
Graphic Character Set
_____________________________________________________________________________________
10. Extensions
There is no absolute requirement for a parser to deal with any-
thing but the particular encoding syntax defined in this
specification. However, it is possible that Compound Text may be
extended in the future, and as such it may be desir- able
to construct the parser to handle 2022/6429 syntax more gen-
erally.
There are two general formats covering all control sequences that
are expected to appear in extensions:
01/11 {I} F
For this format, I is always in the range 02/00 to
02/15, and F is always in the range 03/00 to 07/14.
09/11 {P} {I} F
For this format, P is always in the range 03/00 to
03/15, I is always in the range 02/00 to 02/15, and F
is always in the range 04/00 to 07/14.
In addition, new (singleton) control characters (in the C0 and C1
ranges) might be defined in the future.
Finally, new kinds of ``segments'' might be defined in the future
using syntax similar to extended segments:
01/11 02/05 02/15 F M L
For this format, F is in the range 03/05 to 3/15. M
and L are as defined in extended segments. Such a seg-
ment will always be followed by the number of octets
defined by M and L. These octets can have arbitrary
values and need not follow the internal structure
defined for current extended segments.
If extensions to this specification are defined in the
future, then any string incorporating instances of such exten-
sions must start with one of the following control sequences:
01/11 02/03 V 03/00 ignoring extensions is OK
01/11 02/03 V 03/01 ignoring extensions is not OK
In either case, V is in the range 02/00 to 02/15 and indi- cates
the major version minus one of the specification being used.
These version control sequences are for use by clients that
implement earlier versions, but have imple- mented a general
parser. The first control sequence indi- cates that it is
acceptable to ignore all extension control sequences; no
mandatory information will be lost in the pro- cess. The second
control sequence indicates that it is unacceptable to ignore any
extension control sequences; man- datory information would be
lost in the process. In gen- eral, it will be up to the client
generating the Compound Text to decide which control sequence to
use.
11. Errors
If a Compound Text string does not match the specification here
(e.g., uses undefined control characters, or undefined
control sequences, or incorrectly formatted extended seg-
ments), it is best to treat the entire string as invalid, except
as indicated by a version control sequence.