automake
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: type errors, command length limits, and Awk


From: Jacob Bachmeyer
Subject: Re: type errors, command length limits, and Awk
Date: Wed, 16 Feb 2022 00:04:40 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.22) Gecko/20090807 MultiZilla/1.8.3.4e SeaMonkey/1.1.17 Mnenhy/0.7.6.0

Mike Frysinger wrote:
On 15 Feb 2022 21:17, Jacob Bachmeyer wrote:
Mike Frysinger wrote:
context: https://bugs.gnu.org/53340
Looking at the highlighted line in the context:

thanks for getting into the weeds with me

You are welcome.

  echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
It seems that the problem is that am__base_list expects ListOf/File (and produces ChunkedListOf/File) but am__pep3147_tweak emits ListOf/Glob. This works in the usual case because the shell implicitly converts Glob -> ListOf/File and implicitly flattens argument lists, but results in the overall command line being longer than expected if the globs expand to more filenames than expected, as described there.

It seems that the proper solution to the problem at hand is to have am__pep3147_tweak expand globs itself somehow and thus provide ListOf/File as am__base_list expects.

Do I misunderstand?  Is there some other use for xargs?

if i did not care about double expansion, this might work.  the pipeline
quoted here handles the arguments correctly (other than whitespace splitting
on the initial input, but that's a much bigger task) before passing them to
the rest of the pipeline.  so the full context:

  echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
  while read files; do \
    $(am__uninstall_files_from_dir) || st=$$?; \
  done || exit $$?; \
...
am__uninstall_files_from_dir = { \
  test -z "$$files" \
    || { test ! -d "$$dir" && test ! -f "$$dir" && test ! -r "$$dir"; } \
    || { echo " ( cd '$$dir' && rm -f" $$files ")"; \
         $(am__cd) "$$dir" && rm -f $$files; }; \
  }

leveraging xargs would allow me to maintain a single shell expansion.
the pathological situation being:
  bar.py
  __pycache__/
    bar.pyc
    bar*.pyc
    bar**.pyc

py_files="bar.py" which turns into "__pycache__/bar*.pyc" by the pipeline,
and then am__uninstall_files_from_dir will expand it when calling `rm -f`.

if the pipeline expanded the glob, it would be:
  __pycache__/bar.pyc __pycache__/bar*.pyc __pycache__/bar**.pyc
and then when calling rm, those would expand a 2nd time.

If we know that there will be _exactly_ one additional shell expansion, why not simply filter the glob results through `sed 's/[?*]/\\&/g'` to escape potential glob metacharacters before emitting them from am__pep3147_tweak? (Or is that not portable sed?)

Back to the pseudo-type model I used earlier, the difference between File and Glob is that Glob contains unescaped glob metacharacters, so escaping them should solve the problem, no? (Or is there another thorn nearby?)

[...]

which at this point i've written `xargs -n40`, but not as fast :p.

Not as fast, yes, but certainly portable!  :p

The real question would be if it is faster than simply running rm once per file. I would guess probably _so_ on MinGW (bash on Windows, where that logic would use shell builtins but running a new process is extremely slow) and probably _not_ on an archaic Unix system where "test" is not a shell builtin so saving the overhead and just running rm once per file would be faster.

automake jumps through some hoops to try and limit the length of generated
command lines, like deleting output objects in a non-recursive build.  it's
not perfect -- it breaks arguments up into 40 at a time (akin to xargs -n40)
and assumes that it won't have 40 paths with long enough names to exceed the
command line length.  it also has some logic where it's deleting paths by
globs, but the process to partition the file list into groups of 40 happens
before the glob is expanded, so there are cases where it's 40 globs that can
expand into many many more files and then exceed the command line length.
First, I thought that GNU-ish systems were not supposed to have such arbitrary limits,

one person's "arbitrary limits" is another person's "too small limit" :).
i'm most familiar with Linux, so i'll focus on that.

[...]

plus, backing up, Automake can't assume Linux.  so i think we have to
proceed as if there is a command line limit we need to respect.

So then the answer to my next question is that it is still an issue, even if the GNU system were to allow arguments up to available memory.

and this issue (the context) originated from Gentoo GNU/Linux. Is this a more fundamental bug in Gentoo or still an issue because Automake build scripts are supposed to be portable to foreign system that do have those limits?

to be clear, what's failing is an Automake test.  it sets the `rm` limit to
an articially low one.  [...]

Gentoo happened to find this error before Automake because Gentoo also found
and fixed a Python 3.5+ problem -- https://bugs.gnu.org/38043.  once we fix
that in Automake too, we see this same problem.  i'll remove "Gentoo" from
the bug title to avoid further confusion.

The bug still originated on Gentoo, but that the test is run using an artificially low limit is new information to me. In other words, eliminating the limit is not a solution here: this is specifically about a feature of working around those limits.

I note that the current version of standards.texi also allows configure and make rules to use awk(1); could that be useful here instead? (see below)
[...]
i noticed that autoconf uses awk.  i haven't dug deeper though to see what
language restrictions are there.  GNU awk is obviously out, and POSIX awk
isn't so bad, but do autotools target lower?

In my experience thus far, "lower" than POSIX Awk is almost unusable or at least very tedious to use. There are some significant convolutions in that script in DejaGnu to work around the limitations of the non-POSIX "awk" on Solaris 10 at a point before POSIX awk has been found. I would recommend directly targeting POSIX Awk, since GNU Awk has a POSIX mode that inhibits the GNU extensions (`gawk --posix`) to ease testing, and by the time Automake rules are running, configure should have already located a POSIX Awk on the system.

If you still want to support very old systems without POSIX Awk at all, I would consider the simple, slow, safe approach of running rm for each file appropriate if you do not simply keep the current version of am__base_list for that case.

  it doesn't quite solve the
problem though as the biggest issue is interacting with the filesystem via
globs and quoting.  awk doesn't have a glob().  it has a system() which is
just arbitrary shell code which is what i already have :(.

The main advantage I see awk providing for am__base_list is the "length()" builtin, so you could both count how many entries have been placed on the list (using an ordinary awk variable) and keep track of the overall length of the list itself (using length() on the variable where you build up the list); something like:

8<------
awk 'BEGIN { maxlen = @maxlen@ ; maxarg = @maxarg@ }
{ for ( i = 1; i <= NF; i++)
   if ( length(out) + length($i) >= maxlen || args >= maxarg ) {
     print out ; out = "" ; args = 0
   } else { out = out" "$i ; args++ }
} END { print out }'
8<------

where @maxlen@ and @maxarg@ would be determined and substituted by configure. (Feel free to use or adapt that code under GPL3+, by the way, if it is enough to not simply be inherently public domain.)

i think if we're at the point where we have to probe the functionality of
tools, i think probing for xargs (or find) is simpler.  we can leverage it
if available, otherwise fallback to doing one `rm` per file.  i think that
will make it perform well on the vast majority of systems while not breaking
anyone anywhere.

Probing the functionality of tools is why configure exists in the first place, is it not? :-)


-- Jacob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]