bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mbswidth: add new functions to handle tabs


From: Joel E. Denny
Subject: Re: [PATCH] mbswidth: add new functions to handle tabs
Date: Thu, 14 Jan 2010 02:29:07 -0500 (EST)
User-agent: Alpine 1.00 (DEB 882 2007-12-20)

Hi Bruno,

Thanks for your response.

On Thu, 14 Jan 2010, Bruno Haible wrote:

> This functionality is not yet in gnulib. But I don't think it's general
> enough: Today you want to support tabs. Tomorrow you'll want to support
> line numbers and '\v' characters. Next week someone will want to support
> paragraph separator characters.

Those extensions all seem beyond the concept of a "width", so are we 
moving into a new module?

Before I get to your proposal, I want to understand one issue from Paul's 
original code.  Can the byte \t ever occur within a multibyte sequence?  
For that matter, can the byte \n or \r?  That is, is there anything 
incorrect about looking for \t, \n, and \r in a simple loop over bytes and 
then calling mbsnwidth for the bytes in between?

> Instead of adding more and more variants of mbswidth, I think we should
> make a bigger step and offer a customizable variant. It will take a
> function pointer as argument, that gets passed a control character.
> People will not want to handle many encodings within this function;
> therefore its argument should be a Unicode character. This leads to a
> function like this:

Sounds reasonable.

> > I also considered providing a means to compute line numbers at the same 
> > time.
> 
> With a little change of the interface, it can accommodate this use-case too:
> 
>   /* Compute and store in *COLUMN_P the current column at the end of the
>      given STRING, assuming it starts at the initial value of *COLUMN_P.
>      FUNC handles control characters.  */
>   extern void mbs_update_column (const char *string, int *column_p,
>                                  void (*func) (ucs4_t uc, int *column_p);
> 
> For computing line numbers, one would pass an int(*)[2] as column_p.
>
> The implementation of this function should walk across the string until
> it finds the first non-ASCII control character.

Are you saying that mbs_update_column would basically be mbsnwidth but 
would invoke func whenever the existing iscntrl or iswcntrl invocation 
returns true?  Why do you say non-ASCII?  For example, \n is ASCII 10, and 
mbs_update_column should invoke func upon encountering \n.

I'm now thinking I can use mbs_update_column to help me with issuing 
complaints about NUL bytes.  Other users might wish to issue complaints 
about other unhandled control characters.  However, I'd want to be able to 
pass the current file name to func, and your prototype doesn't give me a 
way to do this.  This prototype is a bit more flexible:

  void mbs_update_column (const char *buf, size_t nbytes,
                          void *data, int *column_p,
                          void (*func) (ucs4_t uc, void *data));

In my case, I'd invoke:

  mbs_update_column (string, n, &loc, &loc->end.column, func);

Someone interested only in columns could invoke:

  mbs_update_column (string, n, &column, &column, func);

For just rows and columns:

  mbs_update_column (string, n, row_col_array, &row_col_array[1], func);

> At this point it converts to Unicode using the u32_conv_from_encoding 
> function, so that it gets a correspondence between multibyte characters 
> and Unicode characters.

I've never used u32_conv_from_encoding.  Should we make the value that 
mbsn_update_column specifies for u32_conv_from_encoding's fromcode 
argument configurable?  Is there a chance we'll encounter unconvertible 
characters?  If so, should we set the iconv_ilseq_handler argument to 
iconveh_error and skip them?  Or should this be configurable?

mbs_update_column doesn't seem like the best name when it's capable of 
handling more than columns.  What about something like mbs_locate and 
mbs_locate_mem?

> > +   BUF[0] is assumed to appear at screen column COLUMN_INIT (origin 1).
> 
> In an API, column numbers should start with 0. Origin-1 column numbers
> can be implemented by adding 1 just before printing the column number.
> Ratiionale: Half of the editors used origin-0 column numbers and half of
> the software use origin-1 column numbers. Therefore you need to
> accommodate both conventions.

The GNU coding standards recommend origin 1 for error messages, and I 
didn't realize it was customary to make APIs the opposite.  Of course, 
either origin can be converted to the other, so I can live with either.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]