help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-bash] Internal parsing & flow


From: Chet Ramey
Subject: Re: [Help-bash] Internal parsing & flow
Date: Wed, 08 Jan 2014 21:54:44 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

On 1/8/14, 12:28 AM, address@hidden wrote:

> The documents I've read indicate the string read from the prompt or file
> is passed through quoting, brace-, tilde-, parameter-, command-,
> arithmetic-expansion processors before being passed to the lex/yacc
> parser. 

This is pretty much the exact opposite of what happens.  You should call BS
on the referenced documents.

> It would seem, based on the rules of the grammar, the input is
> broken into WORDs, and collected into higher-order constructs (like the
> flavors of command).

The basic process is that the shell reads lines of input from the terminal
or a file, breaks those lines into tokens (I use the term `syntactic unit'
later) using shell metacharacters as delimiters, classifies those tokens
appropriately (operators, reserved words, words), continuing until an
entire command has been read.  During lexical analysis, quotes are honored
and preserved, allowing them to prevent tokens from being recognized as
reserved words or operators.  The command is then executed. As part of the
execution process, the words undergo the various word expansions as
appropriate for the command being executed.

There is a pretty decent description of the bash lexical analysis and
parsing process in a paper I wrote for the Architecture of Open Source
Applications, available at http://aosabook.org/en/bash.html .

The Posix standard also contains a description of the basic steps:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_01
.


> Question 1: The code would seem to indicate the various expansion
> processors are performed on WORD_LISTs (I'm looking at
> subst.c:expand_word_list_internal) -- how did the string read from the
> user/script get broken into WORDs? -- I thought this happened *after* the
> expansion processors in the 'word splitting' phase...

Input is broken into tokens, which are then classified as words, reserved
words, or operators, using shell metacharacters as delimiters.  The tokens
that are not operators or reserved words are WORDs.

> Along those same lines: Question 2: What is the definition of a WORD from
> the lexers' perspective? I've gotten turned around a few times trying to
> find the spec for WORD within the lexer [this may be a function of minimal
> Lex experience]...

Lex experience would not help; bash doesn't use it.  A WORD is basically a
syntactic unit that is not an operator or a reserved word.  The function
parse.y:read_token_word() reads and constructs a WORD_DESC.  read_token()
calls it after checking for operators.

> Question 3: Where are the flags of a WORD_DESC assigned? My understanding
> is that these flags guide the semantics of later processing, so it is
> imperative these flags be accurate...

At various places during lexical analysis and parsing.  There are some
flags that are set and used only during word expansion.

> Question 4: Is 'word-splitting' done by the lex/yacc parser? If no, where
> is that implemented?

No, not as such.  It's one of the word expansions.  Look at the functions
word_split and list_string in subst.c.

The parser breaks the input into tokens using shell metacharacters as
delimiters, as described above.

> Question 5: Is expand_word_list_internal the entry-point to the various
> expansion routines? Where are  escaped characters and quoting done, as
> this function seems to cover the other expansions...

Yes.  expand_word_list_internal calls brace_expand_word_list,
shell_expand_word_list, and glob_expand_word_list to perform the basic set
of word expansions.  Since quote characters are preserved in the words
sent through expansion, and affect word expansion, shell_expand_word_list
handles quoted characters and quote removal.

> Question 6: (Related to Q1) If, for example, the input command line is
> 'echo "Hello World"', I'm expecting two WORD (or WORD_DESC) objects -- one
> for 'echo', the other for the de-quoted 'Hello World'. If the original
> line was split on whitespace, there were, at one point then, three WORD
> (or WORD_DESC) objects -- where did [1] and [2] get merged?

They are not merged; they are never split.  The lexical analyzer honors
and preserves quotes.  The words handed to the word expansion code as the
simple command to be expanded are 'echo' and '"Hello World"'.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    address@hidden    http://cnswww.cns.cwru.edu/~chet/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]