bug-lilypond
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issue 2159 in lilypond: Patch: lexer.ll: Warn about non-UTF-8 charac


From: David Kastrup
Subject: Re: Issue 2159 in lilypond: Patch: lexer.ll: Warn about non-UTF-8 characters
Date: Mon, 02 Jan 2012 10:32:31 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)

Hans Aberg <address@hidden> writes:

> On 1 Jan 2012, at 21:06, David Kastrup wrote:
>
>>>> Updates:
>>>>    Labels: Patch-new
>>>> 
>>>> Comment #2 on issue 2159 by address@hidden: Patch: lexer.ll: Warn
>>>> about non-UTF-8 characters
>>>> http://code.google.com/p/lilypond/issues/detail?id=2159#c2
>>>> 
>>>> lexer.ll: Warn about non-UTF-8 characters
>>>> 
>>>> Making the warnings point to the exact bad byte rather than the
>>>> enclosing construct would be nice.
>>> 
>>> One way to implement this might be to use the Haskell program for Flex
>>> like UTF-8 regular expressions I made:
>>>  http://xcybercloud.blogspot.com/2009/04/unicode-support-in-flex.html
>>> 
>>> First make rules for the Unicode characters you want admit, followed
>>> by a '.' rule which picks up single excluded bytes.
>> 
>> The "unicode characters we want admit" are not single characters, but
>> part of things like identifiers, strings and other stuff.  Cf.
>> <URL:http://codereview.appspot.com/5505090#msg5>
>> for a reasoning about the current approach for this patch.
>
> I translate Unicode character classes into Flex UTF-8 regular
> expressions, so you can apply the other Flex regex operators to get
> that stuff.

What makes you think I did not get that?  Did you actually _read_ the
reasoning I linked to above?  You don't get a single error path in that
case, and doing a catchall with . requires _backing_ _up_ in the lexer
for every non-UTF-8 byte sequence that does not already start with an
invalid byte.

We use uncompressed tables in the lexer and make it a point to have _no_
expressions backing up.  So you need to provide expressions matching any
_bad_ UTF-8 sequence even if its first bytes are identical to that of a
good UTF-8 sequence.

Please try understanding this problem before suggesting a non-fitting
solution again.  I have spent days with doing analysis and trying
alternative approaches, and it is somewhat aggravating if somebody just
goes on assuming I don't know what I am talking about and showing me a
simplistic solution often enough will make me realize my stupidity.

Please run
lex -b
on a flex file of yours that checks for UTF-8 identifiers and check
whether you get any backup states in the resulting lex.backup file.  I
should be quite surprised if you didn't.

-- 
David Kastrup




reply via email to

[Prev in Thread] Current Thread [Next in Thread]