[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnulib] addition: linebreak.h, linebreak.c

From: Paul Eggert
Subject: Re: [Bug-gnulib] addition: linebreak.h, linebreak.c
Date: 05 Apr 2003 00:17:41 -0800
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3

Bruno Haible <address@hidden> writes:

> The file defines the functions in 3 variants, for UTF-8, UTF-16 and
> UCS-4 strings; I usually comment out two of them through "#if 0"
> in order to reduce the size of the generated executables.

> Furthermore a fourth variant, which works on multibyte strings
> (implemented via iconv on top of the UTF-8 code). It is the latter
> which most programs use.

> /* Determine number of column positions required for UC. */
> extern int uc_width (unsigned int uc, const char *encoding);

Is UC a Unicode code position?  Perhaps it should be typedefed, both for
clarity and for efficiency on weird hosts?  E.g.:

  typedef int_fast32_t unicode_char;
  int uc_width (unicode_char uc, const char *encoding);

Won't it be faster if we add an extra function that converts ENCODING
to a small integer or a pointer that represents the encoding, and pass
that small integer or pointer to uc_width instead of passing ENCODING?

> /* Determine number of column positions required for first N units
>    (or fewer if S ends before this) in S.  */
> extern int u8_width (const unsigned char *s, size_t n, const char *encoding);
> extern int u16_width (const unsigned short *s, size_t n, const char 
> *encoding);
> extern int u32_width (const unsigned int *s, size_t n, const char *encoding);

I was confused by the prefixes u8, u16, and u32.  At first I thought
they meant "unsigned integer of width 8 bits", etc.  How about
changing the prefixes to utf8, utf16, and ucs4, respectively?

Also, how about replacing

unsigned char  -> utf8_int
unsigned short -> utf16_int
unsigned int   -> ucs4_int

where we have:

typedef uint_least8_t utf8_int;
typedef uint_least16_t utf16_int;
typedef uint_least32_t ucs4_int;

What do the functions do if the input is invalid, e.g. an octet sequence
that is not valid UTF-8?

reply via email to

[Prev in Thread] Current Thread [Next in Thread]