[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: type errors, command length limits, and Awk
From: |
Jacob Bachmeyer |
Subject: |
Re: type errors, command length limits, and Awk |
Date: |
Wed, 16 Feb 2022 00:04:40 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.22) Gecko/20090807 MultiZilla/1.8.3.4e SeaMonkey/1.1.17 Mnenhy/0.7.6.0 |
Mike Frysinger wrote:
On 15 Feb 2022 21:17, Jacob Bachmeyer wrote:
Mike Frysinger wrote:
context: https://bugs.gnu.org/53340
Looking at the highlighted line in the context:
thanks for getting into the weeds with me
You are welcome.
echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
It seems that the problem is that am__base_list expects ListOf/File (and
produces ChunkedListOf/File) but am__pep3147_tweak emits ListOf/Glob.
This works in the usual case because the shell implicitly converts Glob
-> ListOf/File and implicitly flattens argument lists, but results in
the overall command line being longer than expected if the globs expand
to more filenames than expected, as described there.
It seems that the proper solution to the problem at hand is to have
am__pep3147_tweak expand globs itself somehow and thus provide
ListOf/File as am__base_list expects.
Do I misunderstand? Is there some other use for xargs?
if i did not care about double expansion, this might work. the pipeline
quoted here handles the arguments correctly (other than whitespace splitting
on the initial input, but that's a much bigger task) before passing them to
the rest of the pipeline. so the full context:
echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
while read files; do \
$(am__uninstall_files_from_dir) || st=$$?; \
done || exit $$?; \
...
am__uninstall_files_from_dir = { \
test -z "$$files" \
|| { test ! -d "$$dir" && test ! -f "$$dir" && test ! -r "$$dir"; } \
|| { echo " ( cd '$$dir' && rm -f" $$files ")"; \
$(am__cd) "$$dir" && rm -f $$files; }; \
}
leveraging xargs would allow me to maintain a single shell expansion.
the pathological situation being:
bar.py
__pycache__/
bar.pyc
bar*.pyc
bar**.pyc
py_files="bar.py" which turns into "__pycache__/bar*.pyc" by the pipeline,
and then am__uninstall_files_from_dir will expand it when calling `rm -f`.
if the pipeline expanded the glob, it would be:
__pycache__/bar.pyc __pycache__/bar*.pyc __pycache__/bar**.pyc
and then when calling rm, those would expand a 2nd time.
If we know that there will be _exactly_ one additional shell expansion,
why not simply filter the glob results through `sed 's/[?*]/\\&/g'` to
escape potential glob metacharacters before emitting them from
am__pep3147_tweak? (Or is that not portable sed?)
Back to the pseudo-type model I used earlier, the difference between
File and Glob is that Glob contains unescaped glob metacharacters, so
escaping them should solve the problem, no? (Or is there another thorn
nearby?)
[...]
which at this point i've written `xargs -n40`, but not as fast :p.
Not as fast, yes, but certainly portable! :p
The real question would be if it is faster than simply running rm once
per file. I would guess probably _so_ on MinGW (bash on Windows, where
that logic would use shell builtins but running a new process is
extremely slow) and probably _not_ on an archaic Unix system where
"test" is not a shell builtin so saving the overhead and just running rm
once per file would be faster.
automake jumps through some hoops to try and limit the length of generated
command lines, like deleting output objects in a non-recursive build. it's
not perfect -- it breaks arguments up into 40 at a time (akin to xargs -n40)
and assumes that it won't have 40 paths with long enough names to exceed the
command line length. it also has some logic where it's deleting paths by
globs, but the process to partition the file list into groups of 40 happens
before the glob is expanded, so there are cases where it's 40 globs that can
expand into many many more files and then exceed the command line length.
First, I thought that GNU-ish systems were not supposed to have such
arbitrary limits,
one person's "arbitrary limits" is another person's "too small limit" :).
i'm most familiar with Linux, so i'll focus on that.
[...]
plus, backing up, Automake can't assume Linux. so i think we have to
proceed as if there is a command line limit we need to respect.
So then the answer to my next question is that it is still an issue,
even if the GNU system were to allow arguments up to available memory.
and this issue (the context) originated from Gentoo
GNU/Linux. Is this a more fundamental bug in Gentoo or still an issue
because Automake build scripts are supposed to be portable to foreign
system that do have those limits?
to be clear, what's failing is an Automake test. it sets the `rm` limit to
an articially low one. [...]
Gentoo happened to find this error before Automake because Gentoo also found
and fixed a Python 3.5+ problem -- https://bugs.gnu.org/38043. once we fix
that in Automake too, we see this same problem. i'll remove "Gentoo" from
the bug title to avoid further confusion.
The bug still originated on Gentoo, but that the test is run using an
artificially low limit is new information to me. In other words,
eliminating the limit is not a solution here: this is specifically
about a feature of working around those limits.
I note that the current version of standards.texi also allows configure
and make rules to use awk(1); could that be useful here instead? (see below)
[...]
i noticed that autoconf uses awk. i haven't dug deeper though to see what
language restrictions are there. GNU awk is obviously out, and POSIX awk
isn't so bad, but do autotools target lower?
In my experience thus far, "lower" than POSIX Awk is almost unusable or
at least very tedious to use. There are some significant convolutions
in that script in DejaGnu to work around the limitations of the
non-POSIX "awk" on Solaris 10 at a point before POSIX awk has been
found. I would recommend directly targeting POSIX Awk, since GNU Awk
has a POSIX mode that inhibits the GNU extensions (`gawk --posix`) to
ease testing, and by the time Automake rules are running, configure
should have already located a POSIX Awk on the system.
If you still want to support very old systems without POSIX Awk at all,
I would consider the simple, slow, safe approach of running rm for each
file appropriate if you do not simply keep the current version of
am__base_list for that case.
it doesn't quite solve the
problem though as the biggest issue is interacting with the filesystem via
globs and quoting. awk doesn't have a glob(). it has a system() which is
just arbitrary shell code which is what i already have :(.
The main advantage I see awk providing for am__base_list is the
"length()" builtin, so you could both count how many entries have been
placed on the list (using an ordinary awk variable) and keep track of
the overall length of the list itself (using length() on the variable
where you build up the list); something like:
8<------
awk 'BEGIN { maxlen = @maxlen@ ; maxarg = @maxarg@ }
{ for ( i = 1; i <= NF; i++)
if ( length(out) + length($i) >= maxlen || args >= maxarg ) {
print out ; out = "" ; args = 0
} else { out = out" "$i ; args++ }
} END { print out }'
8<------
where @maxlen@ and @maxarg@ would be determined and substituted by
configure. (Feel free to use or adapt that code under GPL3+, by the
way, if it is enough to not simply be inherently public domain.)
i think if we're at the point where we have to probe the functionality of
tools, i think probing for xargs (or find) is simpler. we can leverage it
if available, otherwise fallback to doing one `rm` per file. i think that
will make it perform well on the vast majority of systems while not breaking
anyone anywhere.
Probing the functionality of tools is why configure exists in the first
place, is it not? :-)
-- Jacob
- Re: portability of xargs, (continued)
type errors, command length limits, and Awk (was: portability of xargs), Jacob Bachmeyer, 2022/02/15