[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bash 5.1 heredoc pipes problematic, shopt needed

From: Greg Wooledge
Subject: Re: bash 5.1 heredoc pipes problematic, shopt needed
Date: Thu, 28 Apr 2022 12:56:35 -0400

On Thu, Apr 28, 2022 at 07:26:19PM +0400, Alexey via Bug reports for the GNU 
Bourne Again SHell wrote:
> Hello.
> I promised you more examples, and here they are:
> Very common case to build a list of files for further processing:
>   declare -a FILES
>   #1
>   FILES=(); time readarray -t FILES <<<"$(find "$d" -xdev -maxdepth 5 -type
> f)"
>   #2
>   # <<< act as a tmp file (due to result bigger than 64K)
>   FILES=(); time while read -r f; do FILES+=("$f"); done <<<"$(find / -xdev
> -maxdepth 5 -type f)"
>   #3
>   FILES=(); time while read -r f; do FILES+=("$f"); done < <(find / -xdev
> -maxdepth 5 -type f)
> From these examples we can see that:
>   - example #1 approximately 2 times faster than example #2, and 4 times
> faster than example #3.
>   - to be more honest, first example should be appended with at least empty
> loop: for f in "${FILES[@]}"; do :; done
>     after such modification example #2 became comparable with example #1

I have a few comments about these examples.

First, when it comes to performance, it should be expected that the
main time bottleneck will be the find command, as it searches through
the file system.  Not the shell reading its results.

Your first two examples both run the find command first, wait for it to
finish, and then dump its results into a temp file for reading by the
shell loop.  The total time required will be the time spent doing the find,
plus the time spent populating the array in the shell.

Your third example is the only one which runs the two processes
simultaneously.  The shell loop will populate the array while the find
command is still running.  In a typical case, I would expect them both
to terminate at about the same moment.

(In your benchmarking, you ran the find command multiple times, which
would have allowed the kernel to read a bunch of file system metadata
into memory.  That artificially lowers the time used by find in your
testing, compared to how such a script would work in real life.)

Second, none of your examples work with arbitrary filenames, which may
contain newline characters.  The solution to that is to use find -print0
and to read the NUL-delimited stream in the shell.

In your first two examples, this is not possible.  The command substitution
will discard the NUL bytes from the stream (with or without a warning,
depending on the bash version).

Your third example can easily be extended to support NULs.  That makes it
the best choice in terms of correctness.

Finally, I'm a little bit surprised that you omitted the obvious fourth
example, readarray < <(find).  You've already observed that readarray
is faster than a while read loop (comparing #1 to #2), so why are you
intentionally crippling the process substitution variant (#3) by forcing
it to use the slower loop?

In bash 4.4 or newer, the readarray can also take a -d '' option to read
a NUL-delimited stream.  So, unless you're supporting older bashes, the
best choice in terms of correctness *and* speed should be:

  time readarray -d '' files < <(find / -xdev -maxdepth 5 -type f -print0)

(Obviously this depends on GNU/BSD find with its -print0 option, but since
you're using -maxdepth which is *also* nonstandard, -print0 should be
available on your platform.)

> Also there is a problem that we can't use `mapfile -t <<<"$()"' as
> equivalent to `mapfile -t < <()', because
> here-string appends a newline, so MAPFILE will have one empty element
> instead of no elements in case of empty subshell result.

But the command substition *removed* the newline first.  In the case
where no filenames contain newlines, you're removing one newline and
adding one newline, so the stream remains unchanged.


unicorn:~$ mapfile -t f < <(find .profile .bashrc -print); echo "${#f[@]}"
unicorn:~$ mapfile -t f <<<"$(find .profile .bashrc -print)"; echo "${#f[@]}"

Of course, they stop being equivalent when you switch to -print0.

>    Bash could do 4096b read() to some internal buffer related to
> file-descriptor and have an emulated lseek()
>    within that buffer.

That would only fix the case where the rest of the input is supposed to
be processed by bash.  It would *not* fix the common case where bash is
reading a little bit of the stream, and then executing another program
to read the remainder of the stream.  For the second program, some of
its data has already been consumed.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]