emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [O] full parser implementation for tag queries (parentheses, fast he


From: Christopher Genovese
Subject: Re: [O] full parser implementation for tag queries (parentheses, fast heading match, and more)
Date: Sat, 4 Aug 2012 14:32:35 -0400

I have attached an updated version of the files org-tag-query-parse.el
and tag-query-tests.el that I posted about last night, and I have
updated them at http://www.stat.cmu.edu/~genovese/emacs/ as well. I've
also appended an updated version of my original note below for
reference.

The issue had to do with the /!? todo-matches in the queries. The bug in
the 7.8 code that I described in Note h of my original post had to do
with properly detecting when a /!? todo match is present. The
approximate fix I suggested was incorrect; the complete fix was
unnecessarily complicated. I realized all this a few minutes after the
post (d'oh) and have now simplified things accordingly.

The new version unifies the parsing of both kinds of queries. I
eliminated a few functions as a result, and the new org-make-tags-matcher
is even simpler. As a result, todo matching deals with all those edge
cases correctly (like "PROP={^.*//.*$}") for free, also allows the {m,n}
construct in todo regexp matches, just like it does in the tag matches.
(Braces are escaped by doubling, e.g. {^.*B{{1,7}}} as I mentioned last
time.) All the tests pass in a fresh session, both with compiled and
non-compiled code, on 23.2 and 24.1.

I'd appreciate any comments, suggestions, or other feedback you have
on all this.

Thanks for your patience,

  Christopher


=== Updated version of original post follows for reference

I am writing an application layer on top of org that uses the
entry mapping API, but I needed both negation of complex
selections and heading searches. Because the current tag query
parser does not handle parenthesized expressions, it does not
allow negating complex queries. At first, I wrote a workaround
solution that mimics/specializes the mapping API, but that
approach seemed inelegant and harder to maintain.

So instead I implemented a full parser for tag queries with a
number of useful features (see the labeled Notes at the bottom
for further comments on these features):

  1. Parenthesized expressions to arbitrary depth are allowed.
  2. A '-' can be used to negate a parenthesized term.             [Note a]
  3. Regex's in {} can contain braces escaped by doubling: {{ }}.  [Note b]
  4. Supports fast property search on HEADING and PRIORITY.        [Note c]
  5. Handles hyphens in property names properly.                   [Note d,h]
  6. Allows only the proper comparison operators, including ==.    [Note e,h]
  7. Allows spaces around operators and terms for readability.     [Note f]
  8. Matchers use the original _expression_ order; not a big
     deal, but free.
  9. The error messages during parsing are reasonably helpful.
  10. Several bug fixes and a cleaner `org-make-tags-matcher'.     [Note h]

I'm submitting the code for your consideration, with the
goal of eventually incorporating this into org.el. I would
be happy to hear any comments or suggestions you have. As
I'll describe below, this involves relatively minor changes
to two existing functions and adding a few new support
functions. I've attached two files org-tag-query-parse.el
(the code) and tag-query-tests.el (a collection of tests
built on a simple framework). I've also put the files in
http://www.stat.cmu.edu/~genovese/emacs/. The comments in
both files will I hope be helpful.

At the risk of going on too long, I'd like to add a few comments
about the code and tests. First, the two existing functions that
are affected in the code are `org-make-tags-matcher' and
`org-scan-tags'. In the new version of the former, I've unified
both kinds of query parsing, leading to a shorter and cleaner
function.. The new version of the latter differs in only a
couple *very minor* places that capture two values that were
already being computed anyway (see the diff reproduced in the
comments). By the way, I'm working from the 7.8.11 code.

Loading org-tag-query-parse.el does not change the original
functions. Instead, I've added a `-NEW' to the names of these
functions and saved the originals also with a `-ORIGINAL' added.
After loading the file, you can choose a version to try by doing

    (org-tmp-use-tag-parser 'new)
and
    (org-tmp-use-tag-parser 'original)

or do (org-tmp-use-tag-parser) to toggle between versions.
You can also just use the names with suffixes directly.
I'd also suggest byte-compiling the file.

I think the place to start looking at the code is the new
version of `org-make-tags-matcher'. The main workhorse for the
new parser is `org-tag-query-parse', with new function
`org-todo-query-parse' handling the /!? todo parsing
at the appropriate point..

The other substantial piece (in terms of lines of code) is a
utility macro `org-match-cond' that is used throughout and makes
the main parser much more readable IMHO. Admittedly, I went a
bit overboard in optimizing it; the first version worked fine
but this one produces really nice code. I'd suggest ignoring
this code (in section "Parsing utility for readable matchers")
on first pass. The docstring is pretty complete, and its use is
more or less self-explanatory.

To run the tests, load org-tag-query-parse.el and tag-query-tests.el
and do

   (tag-test-run :results) ; use :summary for a brief summary of all runs
   (tag-test-other-tests)  ; miscellaneous other tests, including scanning

or name individual suites. They are at the moment:

   (tag-test-run :results 'org-comparison-1)  ; or use :summary
   (tag-test-run :results 'org-comparison-2)
   (tag-test-run :results 'match-results-1)
   (tag-test-run :results 'match-results-2)
   (tag-test-run :results 'should-error-1)

If you have other ideas for tests or find any bugs, please let me
know. Sorry for the homegrown framework; it just sort of grew and
then I was too tired to rewrite the tests. One complication here
is that the original and new algorithms produce different term
orders and use a few different functions. The function
tag-test-transform transforms original results to the new
algorithms conventions, but it does not handle PRIORITY or
HEADING matches at the moment. Use the tree form of the tess (see
match-results-1 for example) on these. Btw, I've run the tests on
GNU Emacs 23.2 and 24.1 (running on OS X lion).

Notes:
   a. There is no need to introduce a new character such as ! for
      negation because the semantics of the - are clear and are
      consistent with its use for tags. A - binds more tightly
      than & which in turn binds more tightly than |. A +
      selector can also be used for positive selection of a
      parenthesized term but it is equivalent to using no
      selector, just as for tags.
     
   b. Because \'s are so heavily used in regex's and because they
      have to be doubled in strings, using \'s for an additional
      escape layer would be messy, ambiguous, and hard to read.
      Only the {}'s need to be escaped and the doubling escapes
      {{ -> { and }} -> } are simple, readable, and fast to
      parse. For example: "+{abc\\{{3,7\\}}}" gives the regex
      "abc\\{3,7\\}". Parity makes correctness clear at a glance.
     
   c. Because headline (and priority) searches can be useful and
      powerful, and because the information on those fields is
      *already processed* in `org-scan-tags', we get those
      special searches *essentially for free*, requiring only two
      minor changes to `org-scan-tags'. See the unified diff in
      comments. The special PRIORITY property already exists; I
      added the special HEADING property for these purposes. I'm
      open to changing the name of course, but I do think the
      feature is both useful and elegant. (I'm using it in my
      application, for instance.)

   d. I did not see it in the manual, but I think that property names
      with hyphens should have these \-escaped -'s in the query
      string, with the escaping slashes removed in the produced
      matcher. This is not currently done, but the new version does.
      See Note h for details.

   e. It seems desirable to support both = and == as equality operators
      since the latter is so common by habit. The new version allows
      this explicitly. The original version does as well, but the
      regex for the comparison operator also allows other operators
      <<, ><, >>, =>, and >= as well, which can produce bad matchers.
      See Note h for details.

   f. Currently, spaces are ignored around &, |, the implicit & between
      terms, around the comparison operators in property searches,
      and around +/- selectors. Spaces are not ignored inside {}'s
      for a regexp match.

   g. The current code also allows +/- selectors before property
      comparisons. I don't really like this because
      +PROP<>"something" and -PROP="something" have the same
      meaning but look very different. But the new code does
      support this. As a side note, there's really no need for
      the & characters as +/- serve the and/and-not function
      completely. But again, no prob.

   h. A few bugs detected in the 7.8.11 code:

      + Faulty test for todo matcher in org-make-tags-matcher
        (string-match "/+" match)

        Ex: (org-make-tags-matcher "PROP={^\\s-*// .*$}") produces
        an erroneous matcher:

            ("PROP={^\\s-*// .*$}" progn
             (setq org-cached-props nil)
             (member "PROP" tags-list))

        Because {}'s can appear in /!? todo-match expressions, and
        becauseA arbitrary strings can be TODO keywords, a simple
        pattern match will not give a complete solution. For
        instance, both PROP={/!} and PROP="/!{/!}" are valid TODO
        keywords (it works!) *and* valid property comparisons. We
        want to find the first slash that is not enclosed in {}'s or
        ""'s; if found, a todo match is needed. The new parser
        handles the two types of matching together and so
        automatically detects the todo matches correctly.
       
      + Property names with -'s are not handled properly (cf. Note d)
       
        Specifically, the escapes are not removed. Example:
        (org-make-tags-matcher "PROP\\-WITH\\-HYPHENS=2")
        produces
       
        ("PROP\\-WITH\\-HYPHENS=2" and
         (progn
         (setq org-cached-props nil)
         (=
          (string-to-number
           (or (org-cached-entry-get nil "PROP\\-WITH\\-HYPHENS")
           ""))
          2))
         t)
       
        The original code /does/ instead remove -'s from tag
        names, which shouldn't have them anyway. I suspect that
        this was intended for property names rather than tag
        names. The new version fixes up property names but does
        not allow -'s in tags.

      + Incorrect comparison operators allowed (cf. Note e)
       
        The regular _expression_ used is "[<=>]\\{1,2\\}" is used to
        detect the comparison operators. But this can produce bad
        matchers that fail opaquely at match time rather than
        giving an appropriate error message at parse time.

        Ex: (org-make-tags-matcher "P<<2") produces

         ("P<<2" and
          (progn
            (setq org-cached-props nil)
            (nil
             (string-to-number (or (org-cached-entry-get nil "P") "")) 2))
          t)

        This is fixed in the new version and delivers an error
        message at parse time.

      + missing org-re (line 7179 in org.el) with posix classes
       
        Minor consistency issue.  This line does not occur in the new
        code.


Thanks and regards,

   Christopher Genovese




On Sat, Aug 4, 2012 at 6:07 AM, Christopher Genovese <address@hidden> wrote:
A small addendum:

Right after I posted (of course!), I noticed both a small mistake and an opportunity for
simplification, relating to detecting and processing the todo expressions after a /.
Specifically, the approximate fix I proposed to the bug in the 7.8 code is insufficient
to handle regexp matching in todo expressions. The full solution using the
new function org-find-todo-query is needed in the code as I have it, and that can just
be plugged in instead of the string-match. (See Note h in the original post, specifically
the string-match given there, and org-find-todo-query.)

But I realized that all that is unnecessary as the todo processing can be easily built
into the new parser. I've mostly done that just now and will test and post an update
Saturday, er...today.

Sorry to muddy the waters by not catching that earlier, and sorrier still if I'm rambling. 
Off to bed...

 -- Christopher



On Sat, Aug 4, 2012 at 3:50 AM, Christopher Genovese <address@hidden> wrote:
I am writing an application layer on top of org that uses the
entry mapping API, but I needed both negation of complex
selections and heading searches. Because the current tag query
parser does not handle parenthesized expressions, it does not
allow negating complex queries. At first, I wrote a workaround
solution that mimics/specializes the mapping API, but that
approach seemed inelegant and harder to maintain.

So instead I implemented a full parser for tag queries with a
number of useful features (see the labeled Notes at the bottom
for further comments on these features):

  1. Parenthesized expressions to arbitrary depth are allowed.
  2. A '-' can be used to negate a parenthesized term.             [Note a]
  3. Regex's in {} can contain braces escaped by doubling: {{ }}.  [Note b]
  4. Supports fast property search on HEADING and PRIORITY.        [Note c]
  5. Handles hyphens in property names properly.                   [Note d,h]
  6. Allows only the proper comparison operators, including ==.    [Note e,h]
  7. Allows spaces around operators and terms for readability.     [Note f]
  8. Matchers use the original _expression_ order; not a big
     deal, but free.
  9. The error messages during parsing are reasonably helpful.
  10. Several bug fixes and a cleaner `org-make-tags-matcher'.     [Note h]

I'm submitting the code for your consideration, with the
goal of eventually incorporating this into org.el. I would be

happy to hear any comments or suggestions you have. As I'll describe
below, this involves relatively minor changes to two existing
functions and adding a few new support functions. I've attached two
files org-tag-query-parse.el (the code) and tag-query-tests.el (a
collection of tests built on a simple framework). I've also
put the files in http://www.stat.cmu.edu/~genovese/emacs/. The
comments in both files will I hope be helpful.

At the risk of going on too long, I'd like to add a few comments
about the code and tests. First, the two existing functions that
are affected in the code are `org-make-tags-matcher' and
`org-scan-tags'. In the new version of the former, I've extracted
out both kinds of query parsing, leading to a shorter and cleaner
function. The new version of the latter differs in only a couple
*very minor* places that capture two values that were already
being computed anyway (see the diff reproduced in the comments).
Btw, I'm working from the 7.8.11 code.

Loading org-tag-query-parse.el does not change the original
functions. Instead, I've added a `-NEW' to the names of these
functions and saved the originals also with a `-ORIGINAL' added.
After loading the file, you can choose a version to try by doing

    (org-tmp-use-tag-parser 'new)
and
    (org-tmp-use-tag-parser 'original)

or do (org-tmp-use-tag-parser) to toggle between versions.
You can also just use the names with suffixes directly.
I'd also suggest byte-compiling the file.

I think the place to start looking at the code is the new version
of `org-make-tags-matcher'. The main entry function for the new
parser is `org-tag-query-parse', though the real workhorse is
actually the function `org-tag-query-parse-1'. There is also a
new function `org-todo-query-parse' which just extracts the
existing todo matching method. (I didn't do anything with that
method as the manual makes it clear that it is of secondary
importance.) I think the modularity here makes
`org-make-tags-matcher' and each separate parser easier to read
and understand.

The other substantial piece (in terms of lines of code) is a utility
macro `org-match-cond' that is used throughout and makes the main
parser much more readable IMHO. Admittedly, I went a bit
overboard in optimizing it; the first version worked fine
but this one produces really nice code. I'd suggest ignoring this
code (in section "Parsing utility for readable matchers") on
first pass. The docstring is pretty complete, and its use is more
or less self-explanatory. Most of its work is done at compile time.

To run the tests, load org-tag-query-parse.el and tag-query-tests.el
and do

   (tag-test-run :results) ; use :summary for a brief summary of all runs
   (tag-test-other-tests)  ; miscellaneous other tests, including scanning

or name individual suites. They are at the moment:

   (tag-test-run :results 'org-comparison-1)  ; or use :summary
   (tag-test-run :results 'org-comparison-2)
   (tag-test-run :results 'match-results-1)
   (tag-test-run :results 'match-results-2)
   (tag-test-run :results 'should-error-1)

If you have other ideas for tests or find any bugs, please let me
know. Sorry for the homegrown framework; it just sort of grew and
then I was too tired to rewrite the tests. One complication here
is that the original and new algorithms produce different term
orders and use a few different functions. The function
tag-test-transform transforms original results to the new
algorithms conventions, but it does not handle PRIORITY or
HEADING matches at the moment. Use the tree form of the tess (see
match-results-1 for example) on these. Btw, I've run the tests on
GNU Emacs 23.2 and 24.1 (running on OS X lion).

Notes:
   a. There is no need to introduce a new character such as ! for
      negation because the semantics of the - are clear and are
      consistent with its use for tags. A - binds more tightly
      than & which in turn binds more tightly than |. A +
      selector can also be used for positive selection of a
      parenthesized term but it is equivalent to using no
      selector, just as for tags.
     
   b. Because \'s are so heavily used in regex's and because they
      have to be doubled in strings, using \'s for an additional
      escape layer would be messy, ambiguous, and hard to read.
      Only the {}'s need to be escaped and the doubling escapes
      {{ -> { and }} -> } are simple, readable, and fast to
      parse. For example: "+{abc\\{{3,7\\}}}" gives the regex
      "abc\\{3,7\\}". Parity makes correctness clear at a glance.
     
   c. Because headline (and priority) searches can be useful and
      powerful, and because the information on those fields is
      *already processed* in `org-scan-tags', we get those
      special searches *essentially for free*, requiring only two
      minor changes to `org-scan-tags'. See the unified diff in
      comments. The special PRIORITY property already exists; I
      added the special HEADING property for these purposes. I'm
      open to changing the name of course, but I do think the
      feature is both useful and elegant. (I'm using it in my
      application, for instance.)

   d. I did not see it in the manual, but I think that property names
      with hyphens should have these \-escaped -'s in the query
      string, with the escaping slashes removed in the produced
      matcher. This is not currently done, but the new version does.
      See Note h for details.

   e. It seems desirable to support both = and == as equality operators
      since the latter is so common by habit. The new version allows
      this explicitly. The original version does as well, but the
      regex for the comparison operator also allows other operators
      <<, ><, >>, =>, and >= as well, which can produce bad matchers.
      See Note h for details.

   f. Currently, spaces are ignored around &, |, the implicit & between
      terms, around the comparison operators in property searches,
      and around +/- selectors. Spaces are not ignored inside {}'s
      for a regexp match.

   g. The current code also allows +/- selectors before property
      comparisons. I don't really like this because
      +PROP<>"something" and -PROP="something" have the same
      meaning but look very different. But the new code does
      support this. As a side note, there's really no need for
      the & characters as +/- serve the and/and-not function
      completely. But again, no prob.

   h. A few bugs detected in the 7.8.11 code:

      + Faulty test for todo matcher in org-make-tags-matcher
        (string-match "/+" match)

        Ex: (org-make-tags-matcher "PROP={^\\s-*// .*$}") produces
        an erroneous matcher:

            ("PROP={^\\s-*// .*$}" progn
             (setq org-cached-props nil)
             (member "PROP" tags-list))

        For all practical purposes it will be enough to do:
       
         (string-match "\\(/\\(!\\)?\\s-*\\)[^{}\"]*$" match)
             
        instead of the current test in org-make-tags-matcher.
        This works as long as the TODO keywords do not contain a
        right brace or quotation marks. (In most other cases, the
        new parser should give an error message at parse time.)
       
        A technicality: this is /not/ a complete solution because
        arbitrary strings can be TODO keywords. For instance,
        both PROP={/!} and PROP="/!{/!}" are valid TODO keywords
        (it works!) *and* valid property comparisons. So, a pattern
        alone is insufficient. We want to find the first slash
        that is not enclosed in {}'s or ""'s; if found, a todo
        match is needed. The function `org-find-todo-query' does
        this and (org-find-todo-query match) can be plugged in
        directly replacing the above (string-match ...) in then
        new `org-make-tags-matcher'.
       
        But because the todo parsing uses {}'s for regex matches,
        TODO keywords with {}'s are ignored anyway. So there's
        no need to go beyond the fixed string-match above.
        The function `org-todo-query-parse', which handles todo
        parsing in the new version, makes this explicit.
       
      + Property names with -'s are not handled properly (cf. Note d)
       
        Specifically, the escapes are not removed. Example:
        (org-make-tags-matcher "PROP\\-WITH\\-HYPHENS=2")
        produces
       
        ("PROP\\-WITH\\-HYPHENS=2" and
         (progn
         (setq org-cached-props nil)
         (=
          (string-to-number
           (or (org-cached-entry-get nil "PROP\\-WITH\\-HYPHENS")
           ""))
          2))
         t)
       
        The original code /does/ instead remove -'s from tag
        names, which shouldn't have them anyway. I suspect that
        this was intended for property names rather than tag
        names. The new version fixes up property names but does
        not allow -'s in tags.

      + Incorrect comparison operators allowed (cf. Note e)
       
        The regular _expression_ used is "[<=>]\\{1,2\\}" is used to
        detect the comparison operators. But this can produce bad
        matchers that fail opaquely at match time rather than
        giving an appropriate error message at parse time.

        Ex: (org-make-tags-matcher "P<<2") produces

         ("P<<2" and
          (progn
            (setq org-cached-props nil)
            (nil
             (string-to-number (or (org-cached-entry-get nil "P") "")) 2))
          t)

        This is fixed in the new version and delivers an error
        message at parse time.

      + missing org-re (line 7179 in org.el) with posix classes
       
        Minor consistency issue.  This line does not occur in the new
        code.


Thanks and regards,

   Christopher Genovese




Attachment: org-tag-query-parse.el
Description: Binary data

Attachment: tag-query-tests.el
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]