[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-gnulib] addition: linebreak.h, linebreak.c
From: |
Paul Eggert |
Subject: |
Re: [Bug-gnulib] addition: linebreak.h, linebreak.c |
Date: |
05 Apr 2003 00:17:41 -0800 |
User-agent: |
Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 |
Bruno Haible <address@hidden> writes:
> The file defines the functions in 3 variants, for UTF-8, UTF-16 and
> UCS-4 strings; I usually comment out two of them through "#if 0"
> in order to reduce the size of the generated executables.
> Furthermore a fourth variant, which works on multibyte strings
> (implemented via iconv on top of the UTF-8 code). It is the latter
> which most programs use.
> /* Determine number of column positions required for UC. */
> extern int uc_width (unsigned int uc, const char *encoding);
Is UC a Unicode code position? Perhaps it should be typedefed, both for
clarity and for efficiency on weird hosts? E.g.:
typedef int_fast32_t unicode_char;
int uc_width (unicode_char uc, const char *encoding);
Won't it be faster if we add an extra function that converts ENCODING
to a small integer or a pointer that represents the encoding, and pass
that small integer or pointer to uc_width instead of passing ENCODING?
> /* Determine number of column positions required for first N units
> (or fewer if S ends before this) in S. */
> extern int u8_width (const unsigned char *s, size_t n, const char *encoding);
> extern int u16_width (const unsigned short *s, size_t n, const char
> *encoding);
> extern int u32_width (const unsigned int *s, size_t n, const char *encoding);
I was confused by the prefixes u8, u16, and u32. At first I thought
they meant "unsigned integer of width 8 bits", etc. How about
changing the prefixes to utf8, utf16, and ucs4, respectively?
Also, how about replacing
unsigned char -> utf8_int
unsigned short -> utf16_int
unsigned int -> ucs4_int
where we have:
typedef uint_least8_t utf8_int;
typedef uint_least16_t utf16_int;
typedef uint_least32_t ucs4_int;
What do the functions do if the input is invalid, e.g. an octet sequence
that is not valid UTF-8?