[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

speeding up `wc -m`

From: Bruno Haible
Subject: speeding up `wc -m`
Date: Mon, 21 May 2018 20:00:03 +0200
User-agent: KMail/5.1.3 (Linux/4.4.0-124-generic; KDE/5.18.0; x86_64; ; )

Hi Pádraig,

> $ yes áááááááááááááááááááá | head -n100000 > mbc.txt
> $ yes 12345678901234567890 | head -n100000 > num.txt
> ===== Before ====
> $ time src/wc -m < mbc.txt
> 2100000
> real    0m0.186s
> $ time src/wc -m < num.txt
> 2100000
> real    0m0.056s

Here's my take on improving this. I'm attaching draft patches that have
this effect on the timings:

* On glibc:

             num     mbc
  Before    0.056   0.152
  After     0.057   0.089
  Speedup    1.0     1.7

* On macOS 10.13:

             num     mbc
  Before    0.153   0.229
  After     0.042   0.112
  Speedup    3.6     2.0

Basically, the two problems that the profiling found were:

  * It is pointless to call locale_charset repeatedly, because the
    locale won't change while 'wc' is running.

  * glibc has a slow mbrtowc() implementation for UTF-8 locales.

Both problems can be addressed with the "abstract factory" design patterns.
Namely, instead of using the generic 'wcwidth'/'mbrtowc' function each
time, let the program produce an optimized 'wcwidth'/'mbrtowc' function
[pointer] once, and then call this optimized function pointer repeatedly
for each character.

While at it, let me also do the same for the initialization of an mbstate_t,
because on macOS the mbstate_t is 128 bytes long but only the first 12 bytes
actually matter.

This factory of function pointers side-steps the portability problems of

  - When you use these new gnulib modules, you are programming against an API
    that is very similar to POSIX, but not exactly POSIX.
  - The platform-specific #ifs have to be adjusted, by the help of configure
  - mbrtowc-factory needs a unit test (for which I have a draft).

I'm presenting the effect on the profiling in separate mails.


Attachment: gnulib-factories.diff
Description: Text Data

Attachment: coreutils-use-factories.diff
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]