Re: string types

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: string types

From:	Tim Rühsen
Subject:	Re: string types
Date:	Mon, 6 Jan 2020 17:08:32 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.3.1

Hi Bruno,

On 1/6/20 1:46 PM, Bruno Haible wrote:
>>>>>   - providing primitives for string allocation reduces the amount of 
>>>>> buffer
>>>>>     overflow bugs that otherwise occur in this area. [1]
>>>>> [1] https://lists.gnu.org/archive/html/bug-gnulib/2019-09/msg00031.html
>>> ...
>> We created a 'catch them all' string/buffer type plus API. It is a good
>> compromise for all kinds of situations, works like a memory buffer but
>> is guaranteed 0-terminated, allows custom stack buffers with fallback to
>> heap if to small.
>>
>> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer.c
>>
>>
>> There also is a sprintf functionality (glibc compatible) using these
>> buffers - and the operation is well faster than glibc's sprintf-like
>> functions for all format strings tested (tested back a few years). The
>> code is also much smaller (380 C code lines), the return values are
>> size_t. It doesn't support float/double.
>>
>> https://gitlab.com/gnuwget/wget2/blob/master/libwget/buffer_printf.c
>>
>> If there is serious interest, I could prepare modules for gnulib.
> 
> It is interesting that your solution does not only cover the simple cases
> (string concatenation, etc.), but also the more complex one, possibly
> with if()s in the generation logic, and all this without blatant potential
> for buffer overflow bugs.
> 
> So, the solution would consists of the following parts:
>   (A) A growable buffer type, with up to N (128 or 1024 or so) bytes on
>       the stack.

Preferable, the initial size and if starting with heap or stack buffer
should be (runtime) configurable.
- initial size because it allows fine-tuning to better avoid reallocations
- initial stack if used as local / temporary buffer
- initial heap when you already know that the resulting string has to
persist the function return

Currently there is are two init functions (I leave away the wget namespace):
int buffer_init(buffer *buf, char *data, size_t size);
buffer *buffer_alloc(size_t size);

buffer_alloc creates a buffer instance on the heap and initializes it
with a heap buffer of size.

buffer_init(buf, data, date_size) initializes 'buf' with the given data
and data_size. data will not be free'd, so stack data can be used here.

buffer_init(buf, NULL, date_size) initializes 'buf' with freshly
allocated heap data of size 'data_size'.

buffer_init(buf, NULL, 0) initializes 'buf' with freshly allocated heap
data of size 128. We could leave this out - it's a currently unused
special case to avoid error handling.

Then there is
int buffer_ensure_capacity(buffer *buf, size_t size);

>   (B) A set of functions for appending to such a growable buffer.

To copy a number of bytes to the beginning (effectively dropping the
previous content):
size_t buffer_memcpy(buffer *buf, const void *data, size_t length);

To append a number of bytes:
size_t buffer_memcat(buffer *buf, const void *data, size_t length);

To copy a string to the beginning (effectively dropping the previous
content):
size_t buffer_strcpy(buffer *buf, const char *s);

To append a string:
size_t buffer_strcat(buffer *buf, const char *s);

To set a number of bytes at the beginning (effectively dropping the
previous content):
size_t buffer_memset(buffer *buf, char c, size_t length);

To append a number of the same bytes:
size_t buffer_memset_append(buffer *buf, char c, size_t length);

>   (C) A function for creating a heap-allocated 'char *' from a growable
>       buffer.

Currently we do:
buffer buf;
buffer_init(&buf, NULL, date_size); // allocate buf.data on heap
... add stuff to buf ...
mydata = buf.data; buf.data = NULL;
buffer_deinit(&buf);

We could make up a (static inline) for this, named
void *buffer_deinit_transfer(buffer *buf);

This function could also call realloc() to shrink 'data' to it's
occupied length.

>   (D) Short-hand functions for the simple cases (like string concatenation).

See above, e.g.
buffer_strcpy(buf, scheme);
buffer_strcat(buf, "://");
buffer_strcat(buf, domain);
buffer_memcat(buf, ":", 1);
buffer_strcat(buf, port_s);
buffer_memcat(buf, "/", 1);
buffer_strcat(buf, path);

But I prefer the slightly slower but better readable form
buffer_printf("%s://%s:%d/%s", scheme, domain, port, path);

Since our printf-like functions directly write into a buffer, there is
no overhead for copying data.

> It would be good to have all these well integrated (in terms of function
> names and calling conventions). So far, in gnulib, we have only pieces of
> it:
>   - Module 'scratch_buffer' is (A) without (B), (C), (D).
>   - Modules 'vasnprintf', 'asprintf' are (B), (C), (D) but without (A).
> 
> Before you start writing the code, it's worth looking at the following
> questions:
>   * Should the module 'scratch_buffer' be reused for (A)? Or is this
>     not possible? If not, can it still have a memory-leak prevention
>     like described in lib/malloc/scratch_buffer.h?

I don't see the advantage of the described memory-leak prevention. On
memory error scratch_buffer_grow() releases all memory, so you can
simply return ? Most functions have a 'cleanup' part anyways, which
needs to be jumped to before return. Why not add a scratch_buffer_free()
there ?

Also, scratch_buffer has IMO these flaws.
(1) The fixed initial size of 1024 on the stack. This should be tunable.
(2) If you need a scratch_buffer itself on the heap, you can't free
those initial 1024 bytes.
(3) The 'expected usage' as described in scratch_buffer.h expects the
developer to explicitly design his functions in a certain way.

>   * What about reusing the complete vasnprintf.c for (B), rather than
>     adding another, limited printf-like implementation?

Yes, would be nice. At least for appending we need an extra
malloc/malloc/memcpy/free.

vasnprintf reallocates RESULTBUF that means we can't use a stack
allocated buffer - thus we lose the performance advantage. Or should we
try snprintf first and fallback to vasnprintf in case of truncation ?

We want another module e.g. buffer-printf to not pull in vasnprintf when
not needing printf-like buffer functions.

Once the buffer module is done, we could think of amending vasnprintf to
better play with the buffer type.

>   * Is it best to implement (D) based on (A), (B), (C), or directly
>     from scratch?

Better from scratch, but the code already exists (with FSF copyright)
and has been used a lot. So it's not really from scratch, though we can
amend/optimize a few things.

Regards, Tim

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: string types, Tim Rühsen, 2020/01/06
- Re: string types, Bruno Haible, 2020/01/06
  - Re: string types, Tim Rühsen <=
    - Re: string types, Tim Rühsen, 2020/01/06

Prev by Date: [bug #57277] FAIL: test-canonicalize-lgpl with GCC 10
Next by Date: Re: string types
Previous by thread: Re: string types
Next by thread: Re: string types
Index(es):
- Date
- Thread