lilypond-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [GLISS] Existing syntax abominations


From: David Kastrup
Subject: Re: [GLISS] Existing syntax abominations
Date: Sat, 22 Sep 2012 06:13:59 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2.50 (gnu/linux)

Janek Warchoł <address@hidden> writes:

> Hi David, James & all,
>
> On Fri, Sep 21, 2012 at 7:41 PM, James <address@hidden> wrote:
>> About the only part of the thread I could follow [...]
>
> Indeed, David's message was really technical.
> I'll try to "translate" his email.  Apart from checking whether i
> understood everything correctly myself, i hope that this will benefit
> you as well.
>
>
> On Fri, Sep 21, 2012 at 6:46 PM, David Kastrup <address@hidden> wrote:
>> Sometimes it is important to be able to parse some expression without
>> further lookahead,
>
> When Lily reads a .ly file, she sees just a load of characters.  She
> needs to use two subprograms, the parser and the lexer, to understand
> the file (translate the text into meaningful objects, like a NoteHead,
> Stem, Accidental etc.).
> Lexer reads the .ly file letter-by-letter.  Lexer's problem is "is
> this letter a continuation of the previous 'word', or the beginning of
> a new one?"  For example, lexer's job is to divide this:
> cis4\fermata-.
> into this:
> cis  4  \fermata  -.
> (a pitch, a duration, a postevent (postevent is something attached to
> a note, like an articulation), another postevent)

Actually, it is - . rather: even articulation shorthands are recognizezd
in the parser.

> After that, the parser's job is to group these 'words' into meaningful
> 'sentences'.  For example,
> c4 g \f d8-.
> becomes
> c4
> g \f
> d8-.
> (i.e., all things that go with the pitch - a duration, articulations
> etc - are merged together).

The "merging" is hierarchical: - and . are merged to -., d and 8 are
merged to d8, then d8 and -. are merged to d8-. and so on.  In fact, the
whole input is finally merged into "start_symbol", and then the parser
is done.

> The problem is that sometimes it's impossible to tell what something
> is without looking at next thing.  For example, when reading this
> \markup " \ bla"
> letter-by-letter, Lily sees
> \  <= a beginning of a command
> m  <= first letter of the command name
> a  <= second letter of the command name
> r   etc.
> k
> u
> p
>    <= whitespace - this means command name ended
> "  <= beginning of a string
>    <= space in the string
> \  <= another character in the string
>
> b
> l
> a
> "  <= end of the string.
>
> That was easy.  Now, take this:

Bad example: as far as the _parser_ is concerned, a string is just a
single entity.  That's one reason quoted strings can contain spaces: the
lexer mever passes them as a _kind_ of token by themselves, but it _can_
pass them inside of the _value_ of a token of kind STRING.

> \markup " \" bla"
> The tricky part is that \" means a double quote char inside the string
> (as opposed to " without a backslash, which means the beginning/end of
> a string).  How Lily should know that the second backslash doesn't
> mean a backslash char inside the string, like in the previous example,
> but rather something special?  That's what we need lookahead for.

No, string recognition is done all in the lexer.  Lookahead is needed in
cases like detecting the end of a music event.  A music event can be all
of the following:

c c'' c''8 c''8-. c''8-.-^  c

How do we recognize when the music event ends?  By taking a look at the
_next_ token and seeing whether we can make it part of the current music
event.  So the decision what the current music event is depends on what
appears next in the input.

Usually, something like { .... } does not require lookahead to form
units since there is a closing delimiter.  Unfortunately, { ... } is not
a complete unit until we haven't checked that no \addlyrics is trailing
it, which _still_ can become part of the expression.

> Lookahead means that before deciding what current letter in input
> means, we look at the next one.

Not "letter", but "token".

> So, everytime Lily sees a backslash inside a string (inside " "), she
> looks at the next letter in input to know whether the backslash is
> just another char or has a special meaning.

The lexer does not really work with "lookahead" as a rule: it can make
more complex decisions (we take some pains to avoid this "backing up"
for performance reasons, but it is not an inherent restriction).

The parser making _hierarchical_ decisions according to a syntax has
just a single lookahead token it ever consults.

> Now, that was lookahead at lexer lever (when gluing letters into
> words).  If i understand correctly, we have similar lookahead in
> parser, i.e. when we have the words already separated and we analyze
> their meaning.  (unfortunately i don't have any example handy).
>
> I hope that it's now clear what a lookahead is.
>
>> for example because lexer modes need switching.
>
> I'm not sure what lexer modes are, but i suppose that it's about
> different rules in different contexts.  For example, when you're
> inside a string you have to do a lookahead when you encounter a
> backslash, but you don't have to do this when you're not inside
> string.

Strings are internal to the lexer: the parser never gets to see or
influence string start and end.  There are other modes like lyricmode,
markupmode, musicmode, chordmode and so on in which the tokens are being
formed according to different rules.

>> I am just now experimenting with code where the _lexer_ will
>> transparently call music functions in its own parser copy, inserting
>> the result back into LilyPond.
>
> I think that David checked what would happen if the lexer "calculated"
> music functions on its own (using a privately run parser) and just
> inserted the results back into LilyPond.
> It's like changing this (not Lily code here, just math)
> a + 2*3
> into this:
> a + 6
> (i.e. calculating 2*3 and inserting the result in place of the 
> multiplication).

More or less.

>> Now I get the following output for cue-clef-new-line.ly:
>>
>> input/regression/cue-clef-new-line.ly:14:20: error: unknown escaped
>> string: `\vI'
>> \addQuote vIQuote {
>>                     \vI }
>> input/regression/cue-clef-new-line.ly:14:20: error: syntax error,
>> unexpected STRING
>> \addQuote vIQuote {
>>                     \vI }
>>
>> The input is
>>
>> vI = \relative c'' { \clef "treble" \repeat unfold 40 g4 }
>> \addQuote vIQuote { \vI }
>
> LilyPond says "i don't know what a \vl is.  \vl looks like a string,
> and i don't want a string here"

No, it does not look like a string.  The lexer sees \vl, recognizes it
as a command and looks up its meaning.  It has no meaning, so it
complains, and to pass anything at all to the parser, it passes the
thing as a STRING to the parser, in the hope that this backslash might
just have been part of something intended as a word.  It wasn't, and so
the parser is the next one to complain that it has no idea what to do
with a STRING in this context.

>> Huh?  Why is \vI undefined at the time \addQuote is called?  Now since
>> \addQuote is called in the lexer in this LilyPond version,
>
> David's experimental change resulted in \addQuote being called and
> "calculated" during lexer phase.  This didn't happen before.

That is not the actual problem.  The problem is that it is being called
while the assignment has not yet been completed.  Previously, music
functions are calculated in the parser, so the parser would have looked
at the next token MUSIC_FUNCTION (for \addQuote) and would have decided
that it does not match ADDLYRICS, then it would have completed the
assignment with MUSIC_FUNCTION as the lookahead, and only _then_ would
have continued with the following music expression.

>> it is called when the preceding code is asking for a lookahead token.
>
> If \addQuote was called during lexer phase, it means that it happened
> when a previous element of the code asked the lexer to lookahead to
> check something.
>
>> Why on Earth
>> would the preceding code ask for a lookahead token to finish that
>> assignment?
>
> Preceeding code is a closing brace.  Closing brace generally means end
> of music expression.  It's strange that a closing brace says "please
> check what happens after me, because i'm not sure what i'm doing
> here".  The strange this is: why brace asks for this, and not just say
> "hey, i closed a music expression now!"?
>
>> Calling lilypond with -ddebug-parser tells us:
>>
>> Entering state 55
>> Reducing stack by rule 134 (line 1007):
>>    $1 = nterm braced_music_list (: )
>> -> $$ = nterm sequential_music (: )
>> Stack now 0 2 6 168 296
>> Entering state 57
>> Reducing stack by rule 157 (line 1093):
>>    $1 = nterm sequential_music (: )
>> -> $$ = nterm grouped_music_list (: )
>> Stack now 0 2 6 168 296
>> Entering state 61
>> Reducing stack by rule 155 (line 1088):
>>    $1 = nterm grouped_music_list (: )
>> -> $$ = nterm music_bare (: )
>> Stack now 0 2 6 168 296
>> Entering state 60
>> Reducing stack by rule 152 (line 1082):
>>    $1 = nterm music_bare (: )
>> -> $$ = nterm composite_music (: )
>> Stack now 0 2 6 168 296
>> Entering state 136
>> Reading a token: Starting parse
>> Entering state 0
>> Reading a token: Next token is token "(music-function-call)" (: #<Music
>> function
>>
>> So here is where we have "composite_music", and the following \addQuote
>> is called prematurely in the search of a lookahead token.  Why?  Let's
>> look at state 136 in the parser:
>>
>> state 136
>>
>>   130 music_assign: composite_music .  ["end of input", error, "\\repeat", 
>> "\\alternative", "\\default", ':', '(', ')', '[', ']', '~', '^', '_', "--", 
>> "__", "\\!", EVENT_IDENTIFIER, E_UNSIGNED, "\\[", "\\]", "\\(", "\\)", 
>> "\\<", "\\>", DURATION_IDENTIFIER, REAL, UNSIGNED, NUMBER_IDENTIFIER, 
>> "\\accepts", "\\alias", "\\book", "\\bookpart", "\\change", "\\chordmode", 
>> "\\chords", "\\consists", "\\context", "\\defaultchild", "\\denies", 
>> "\\description", "\\drummode", "\\drums", "\\figuremode", "\\figures", 
>> "\\header", "\\version-error", "\\layout", "\\lyricmode", "\\lyrics", 
>> "\\lyricsto", "\\markup", "\\markuplist", "\\midi", "\\name", "\\notemode", 
>> "\\override", "\\paper", "\\remove", "\\revert", "\\score", "\\sequential", 
>> "\\set", "\\simultaneous", "\\tempo", "\\type", "\\unset", "\\with", 
>> "\\new", "<", "<<", ">>", "\\", "\\~", FIGURE_OPEN, LYRIC_MARKUP, 
>> MULTI_MEASURE_REST, "(backed-up?)", "(reparsed?)", CHORD_REPETITION, 
>> CONTEXT_MOD_IDENTIFIER, DRUM_PITCH, PITCH_IDENTIFIER, FRACTION, 
>> LYRICS_STRING, LYRIC_MARKUP_IDENTIFIER, MARKUP_IDENTIFIER, 
>> MARKUPLIST_IDENTIFIER, MUSIC_IDENTIFIER, NOTENAME_PITCH, RESTNAME, 
>> SCM_IDENTIFIER, SCM_TOKEN, STRING, STRING_IDENTIFIER, TONICNAME_PITCH, '-', 
>> '{', '}', '|']
>>   227 new_lyrics: . "\\addlyrics" address@hidden composite_music
>>   229           | . new_lyrics "\\addlyrics" address@hidden composite_music
>>   230 re_rhythmed_music: composite_music . new_lyrics
>>
>>     "\\addlyrics"  shift, and go to state 202
>>
>>     $default  reduce using rule 130 (music_assign)
>>
>>     new_lyrics  go to state 203
>>
>>     Conflict between rule 130 and token "\\addlyrics" resolved as shift 
>> (COMPOSITE < "\\addlyrics").
>
> This is very technical and i can only guess what it means based on
> what David writes below.
> Basically i think David provided this for people who can understand it
> - if you cannot, skip it as there is a written explanation.
>
>> Look and behold: after the closing brace of the sequential music, the
>> expression is not finished because LilyPond has to see whether there is
>> an \addlyrics after that, as it would become part of the expression.
>
> When there's an \addlyrics present, the music expression doesn't end
> where it normally does.  That's why there was a lookahead from } ,
> which caused \addQuote to be evaluated before music expression being
> assigned to \vl was finished.
>
>> Well, it seems my "stealthy" music function call in the lexer can't work
>> just as stealthily as that since a brace-enclosed music expression is
>> potentially incomplete.  That's actually rather bad news for other
>> potentially mode-switching commands as well.
>
> I'm not sure about mode-switching commands.
> But generally, having to do excessive lookahead is bad.  You prefer to
> know what's happening without looking ahead.

Well, our syntax can't get along without lookahead.  But we have
different modes, like lyrics mode, music mode, markup mode etc in which
tokens are recognized differently.  It is the parser's job to switch
between those modes, and if it does this decision based on lookahead,
the lookahead is still recognized in the previous mode and can't be
reinterpreted in its "proper" mode.

This is actually the reason I recently made recognition of commands and
strings the same in the various modes: previously line-width was a
single lexical unit in INITIAL mode (which is used inside of context
definitions and output definitions), but was three units, line - width
in most other modes.  Now if you had music interspersed in INITIAL mode,
this might have looked like
{ ... } line-width = ...
and since } needed a lookahead token to be complete, the lookahead
token, still scanned in music mode, would have been just line, and there
would have been no way to get to the single STRING line-width later.

line-width is still being scanned in the wrong mode, but at least for
this STRING, the rules are now similar enough that it does not matter.

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]