help-source-highlight
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Order of definitions in source-highlight 2.1


From: Lorenzo Bettini
Subject: Re: [Help-source-highlight] Order of definitions in source-highlight 2.10
Date: Fri, 05 Sep 2008 01:28:29 +0200
User-agent: Thunderbird 2.0.0.16 (X11/20080724)

address@hidden wrote:
I just upgraded source-highlight to 2.10 and I am noticing some strange behavior.

Suppose we have the file foo.lang:

symbol = "/"
comment start "//"

And the file test.foo:

// foo

The language definition is taken from the source-highlight manual, section 7.4: "Order of definitions". Note that the definitions are in the wrong order, according to the manual: "The first expression will always be matched first, and the second expression will never be matched." And yet:

$ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
<!-- Generator: GNU source-highlight 2.10
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt><span class="comment">// foo</span>
</tt></pre>

This was different with version 2.9:

$ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
<!-- Generator: GNU source-highlight 2.9
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt><span class="symbol">//</span><span class="normal"> foo</span>
</tt></pre>

What has changed between version 2.9 and 2.10?

Hi there

as I had already written in the previous email, the matching strategy changed between 2.9 and 2.10:

"The strategy used by source-highlight is to select the first rule that matches the longest part of the text with the smallest prefix (i.e., the initial part of the line that contains no language element). (Thus, as already noted in the previous sections, the order of language definitions is crucial.)"

however, when working on the documentation, I actually realized that this strategy is too involved and a little bit confusing, not to mention that it has a lot of overhead, since it tests ALL the rules in a state.

Then, I realized that basically the rule that should be selected is the one with the smallest prefix, but we could stop testing rules as soon as we find a rule that matches and whose prefix (i.e., the part of the string before the matched one) contains only spaces (or it's empty). I think this is also the strategy used by standard regular expression engines, or at least, this one seems to be enough for programming languages.

Thus, for instance, if I have

i = null;

if I match null as a keyword, its prefix is "i = " and I should not stop testing other rules, since otherwise I would not test the symbol rule (that is defined later).

While, if I have

   if (exp)

as soon as I match "if" as a keyword, since its prefix is " ", I can stop testing other rules (this way, I don't even risk to match "if(exp)" as a function call (note that with the previous strategy this would match better since it matches more characters).

I think this is the right strategy and it brings the example in the documentation to work again as described.

I've uploaded a temporary version that uses this strategy (and it also performs faster as expected) here:

http://gdn.dsi.unifi.it/~bettini/source-highlight-2.10.1.tar.gz

I'd really appreciate to get some feedback, especially do you think that this new strategy makes sense?

There's also a new test in the tests directory: test_string_stop.lang:

keyword = "if|class"

type = 'int'

comment delim "/*" "*/"

# thus this won't catch "/* */ /" as a regexp,
# since comment elem definition comes first
regexp = '/.*/.*/'

# this won't match if ( ) as a function,
# since keyword elem definition comes first
function = '([[:alpha:]]|_)[[:word:]]*[[:blank:]]*\(*[[:blank:]]*\)'

# the following order is conceptually wrong,
# since "//" won't be highlighted as a comment, but as two symbols
symbol = "/"
comment start "//"

which can be used with the input file test_string_stop.java, which produces the attached output, which is the one expected with the new strategy.

cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net
/* comment */ final /
/my/regexp/
  if ( ) {
    class;
    myfun ( );
  }
  int i;
  int ( );
// comment? or two symbols?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]