[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[nongnu] elpa/parseclj 0702332e5d 046/185: Merge branch 'master' of http
From: |
ELPA Syncer |
Subject: |
[nongnu] elpa/parseclj 0702332e5d 046/185: Merge branch 'master' of https://github.com/lambdaisland/clj-parse |
Date: |
Tue, 28 Dec 2021 14:05:14 -0500 (EST) |
branch: elpa/parseclj
commit 0702332e5d9fb7ea4caecea4fc18422eb666e799
Merge: 553f8618e1 29411e2103
Author: Daniel Barreto <dbarreto@talpor.com>
Commit: Daniel Barreto <dbarreto@talpor.com>
Merge branch 'master' of https://github.com/lambdaisland/clj-parse
---
.travis.yml | 4 +-
DESIGN.md | 299 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
clj-lex-test.el | 44 +++++++++
clj-lex.el | 53 ++++++----
4 files changed, 378 insertions(+), 22 deletions(-)
diff --git a/.travis.yml b/.travis.yml
index c6f2126964..a73fa2eddf 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -2,8 +2,8 @@ language: generic
dist: trusty
env:
- VERSION=24.3.1 EMACS=emacs PACKAGE=emacs24
- - VERSION=25.2.2 EMACS=emacs PACKAGE=emacs25
- - VERSION=25.2.2 EMACS=emacs PACKAGE=emacs25 CLJ_PARSE_LINT=true
+ - VERSION=25.2.2 EMACS=emacs-25.2 PACKAGE=emacs25
+ - VERSION=25.2.2 EMACS=emacs-25.2 PACKAGE=emacs25 CLJ_PARSE_LINT=true
matrix:
allow_failures:
diff --git a/DESIGN.md b/DESIGN.md
new file mode 100644
index 0000000000..f8a44511f5
--- /dev/null
+++ b/DESIGN.md
@@ -0,0 +1,299 @@
+# clj-parse Design Goals / Roadmap
+
+clj-parse is an Emacs Lisp library for parsing Clojure code and EDN data. It
+supports several input and output formats, all powered by the same shift-reduce
+parser function.
+
+This documents describes the design goals for clj-parse, and as such may
describe features which are not implemented yet.
+
+## Motivation
+
+People's parsing needs can differ for several reasons
+
+- parsing code (Clojure) vs parsing data (EDN, a subset of Clojure)
+- asserting valid input (fail fast) vs dealing gracefully with syntax errors
(editor integration)
+- analyzing code (whitespace can be ignored) vs doing transformations on code
(whitespace aware round-tripping)
+- parsing the contents of a buffer, or a string, or a file
+- strict parsing (all tagged literals must be understood) vs pass-through
(parsing/unparsing unknown tagged literals while ignoring semantics)
+
+This library aims to support all of these use cases, either directly, or by
providing the building blocks to do it yourself.
+
+## Prior art
+
+[edn.el](https://github.com/expez/edn.el) is an EDN-to-elisp parser based on
the PEG parser generator library.
+
+## Challenges
+
+The data structures available in Emacs are less rich than those used by
Clojure.
+
+- Clojure has `nil` and `false`, Emacs only has `nil`.
+- Emacs has no notion of sets
+- Emacs has no date/timestamp type
+- Emacs does not support custom records/types (there is a Common Lisp inspired
object system, but it implements types on top of regular lists and vectors).
+- Emacs does not support adding metadata to values
+
+On the other hand Emacs supports strings/buffers with arbitrary encoding, on
the JVM and on JavaScript strings are always UTF-16/UCS-2.
+
+## Architecture
+
+The implementation is implemented in three parts: a lexer, a parser, and
multiple reducers.
+
+### Lexer
+
+The *lexer* turns the input text, a buffer, into tokens, data structures
representing a single syntactical unit, such as a symbol, a number, or a
delimiter like "(", ")", "#{", or "#_".
+
+In clj-parse the lexer is a single function `clj-lex-next` which can be called
repeatedly to get a sequence of tokens. `clj-lex-next` returns the token at
"point" (i.e. the Emacs cursor position), and moves point to after the token.
+
+A *token* is a association list (list of cons cells), with keys `:token-type`,
`:form`, `:position`, and optionally `:error-type`.
+
+Note: we don't add line/column numbers to the token, the consumer can add
these if needed based on the position of point before calling `clj-lex-next`.
+
+Example:
+
+This Clojure/EDN input:
+
+``` clojure
+(42 "hello" #_ignored #{:a})
+```
+
+Results in these tokens
+
+``` emacs-lisp
+((:token-type . :lparen)
+ (:form . "(")
+ (:pos . 1))
+
+((:token-type . :number)
+ (:form . "42")
+ (:pos . 2))
+
+((:token-type . :whitespace)
+ (:form . " ")
+ (:pos . 4))
+
+((:token-type . :string)
+ (:form . "\"hello\"")
+ (:pos . 5))
+
+((:token-type . :whitespace)
+ (:form . " ")
+ (:pos . 12))
+
+((:token-type . :discard)
+ (:form . "#_")
+ (:pos . 13))
+
+((:token-type . :symbol)
+ (:form . "ignored")
+ (:pos . 15))
+
+((:token-type . :whitespace)
+ (:form . " ")
+ (:pos . 22))
+
+((:token-type . :set)
+ (:form . "#{")
+ (:pos . 23))
+
+((:token-type . :keyword)
+ (:form . ":a")
+ (:pos . 25))
+
+((:token-type . :rbrace)
+ (:form . "}")
+ (:pos . 27))
+
+((:token-type . :rparen)
+ (:form . ")")
+ (:pos . 28))
+
+((:token-type . :eof)
+ (:form . nil)
+ (:pos . 29))
+```
+
+Note that the lexer makes no attempt to "interpret" the tokens, it merely
finds their boundaries. Concatentating the `:form` values yields a string
identical to the original input.
+
+Tokens can be recognized by the `:token-type` key, which must always come
first in the association list.
+
+## Shift-reduce parser
+
+The parser is again a single function `clj-parse-reduce`. It is a higher order
function, with much of the final result determined by the `reduce-leaf` and
`reduce-node` functions passed in as arguments.
+
+`clj-parse-reduce` internally operates by using a stack. This stack contains
tokens (as returned by `clj-lex-next`), and reduced values.
+
+`reduce-leaf` is a two-argument function. It takes the current value of the
stack, and a token, and returns an updated stack, typically by parsing the
token to a value and pushing that value onto the stack.
+
+`reduce-node` is a three-argument function. It takes the current value of the
stack, a node type, and a list of children, and returns an updated stack.
+
+The parser reads consecutive input tokens. If the token represents a leaf node
(like a number, symbol, string), then it calls `reduce-leaf`, giving it a
chance to add a value to the stack. If the token is a non-leaf node (a
delimiter) it gets put on the stack as-is. This part is known as the "shift"
phase.
+
+After "shifting" a value on to the stack, the parser tries to "reduce", by
inspecting the top one, two, or three items on the stack.
+
+If the top item is a closing delimiter token, then the parser scans down the
stack until it finds a matching opening delimiter. It pops both delimiters and
everything in between them off the stack, and passes them to `reduce-node`,
which can "reduce" this sequence to a single value (say, a list), and push that
item onto the stack.
+
+The type of values pushed onto the stack depends on the reducing functions
used. The parser only distinguishes between tokens and non-tokens. It follows
that a reducing functions should not push raw tokens back onto the stack.
+
+When parsing finishes the stack will contain all parsed forms in reverse
order. It will call `reduce-node` once more with a node type of `:root`, to
give it a chance to finalize.
+
+### Example
+
+An example, when parsing the following EDN, with parsing done up to the
position of `|`, the stack looks as follows.
+
+``` clojure
+;; input
+(1 2 (3|))
+```
+
+``` emacs-lisp
+;; stack
+(3
+ ((:token-type . :lparen) (:form . "(") (:pos . 6)))
+ 2
+ 1
+ ((:token-type . :lparen) (:form . "(") (:pos . 1)))
+```
+
+Now the parser encounters the first closing parenthesis. It pops `3` and
`:lparen` off the stack, and passes `(((:token-type . :lparen) 3 ((:token-type
. :rparen)))` to `reduce-node`, which reduces this a single list, and pushes it
onto the stack.
+
+``` clojure
+;; input
+(1 2 (3)|)
+```
+
+``` emacs-lisp
+;; stack
+((3)
+ 2
+ 1
+ ((:token-type . :lparen) (:form . "(") (:pos . 1)))
+```
+
+Now the parser encounters the second closing parenthesis. It pops everything
until `:lparen` off the stack, and passes it to `reduce-node`, which turns the
result into a list and pushes it onto the stack.
+
+``` clojure
+;; input
+(1 2 (3))|
+```
+
+``` emacs-lisp
+;; stack
+((1 2 (3)))
+```
+
+### Dealing with parse errors
+
+`clj-parse-reduce` needs to be able to parse invalid input. Imagine analyzing
a user's buffer while they are editing, to provide contextual help or do
linting. Even when delimiters are unbalanced it should still be possible to get
a "best effort" parse result. It turns out the shift-reduce approach provides
that out of the box. The result of parsing invalid input is a stack which still
has unreduced tokens in it.
+
+Unmatched opening delimiter:
+
+``` clojure
+;; input
+(1 2 { 3)
+```
+
+``` emacs-lisp
+;; result
+((1 2 ((:token-type . :lbrace)) 3))
+```
+
+Unmatched closing delimiter:
+
+``` clojure
+;; input
+(1 2 3))
+```
+
+``` emacs-lisp
+;; result
+((1 2 3) ((:token-type . :lparen)))
+```
+
+In many cases it will be desirable to "fail fast", and raise an error as soon
as a syntax error is encountered. A `reduce-node` function can do so if it
wishes by checking its input sequence for raw tokens, and raising an error if
any are present.
+
+## EDN vs Clojure
+
+EDN syntax is a subset of Clojure syntax used for representing data (as
opposed to code).
+
+Several constructs that are common in Clojure code are not valid in EDN. This
includes
+
+- quoting, syntax quoting, syntax quote splicing, etc.
+- read time evaluation with `#=(,,,)`
+
+## Dealing with tagged literals
+
+EDN/Clojure syntax is extensible. Values can be prefixed with user-provided
tags, which indicates they need to be transformed after reading by a registered
handler function.
+
+In general a consumer of EDN needs to be aware of the possible tags that can
occur in the input, as well as their semantics, in order to preserve the data's
semantics. When a parser encounters an unknown tag then this is considered an
error.
+
+The EDN parser functions will take an optional tag handler function. This
function receives two arguments, the tag symbol, and the parsed form that
follows the tag. It should either return a correctly coerced value, or raise an
error. A default function that knows how to deal with the tags that are part of
the EDN spec will be provided.
+
+When parsing code to an AST the situation is different, here tags simply
become part of the AST, without caring about their semantics.
+
+## AST
+
+An AST (abstract syntax tree) is a tree structure made up of nodes. A node
looks like this
+
+``` emacs-lisp
+((:node-type . :root)
+ (:position . 0)
+ (:children . (((:node-type . :comment)
+ (:position . 0)
+ (:form . ";; cool stuff\n;; in here"))
+ ((:node-type . :list)
+ (:position . 26)
+ (:children . (((:node-type . :number)
+ (:position . 27)
+ (:form "123")
+ (:value 123))))))))
+```
+
+Nodes are an alist with `:node-type` as the first key and a `:position`. Leaf
nodes have `:form` and `:value` keys, for the unparsed and parsed form
respectively. `:whitespace` and `:comment` nodes only have a `:form`, not a
`:value`.
+
+Non-leaf nodes contain a list of `:children`.
+
+## Public API
+
+clj-parse provides three "parse modes"
+
+- `edn` meant for parsing data, it parses EDN to emacs lisp data
+- `ast` meant for analyzing source code, parses to a "semantic" AST, does not
preserve whitespace or comments
+- `source` meant for transforming/round-tripping source codes. Like `ast`, but
preserves whitespace and comments
+
+For each of these there can be the following functions
+
+- `clj-parse-{mode}` parse the current buffer starting at `point`, raise an
error when syntax/lexing errors are encountered
+- `clj-parse-{mode}-full` same as above but ignore syntax errors, returning a
partially parsed result
+- `clj-print-{mode}` turn the result of the corresponding parse function back
into Clojure/EDN, and insert it into the current buffer
+
+Each of these have `-str` variant which instead works on strings. This yields
a total potential API of:
+
+```
+(defun clj-parse-edn (&OPTIONAL tag-handler))
+(defun clj-parse-edn-full (&OPTIONAL tag-handler))
+(defun clj-print-edn (edn))
+(defun clj-parse-edn-str (string &OPTIONAL tag-handler))
+(defun clj-parse-edn-full-str (string &OPTIONAL tag-handler))
+(defun clj-print-edn-str (edn))
+(defun clj-parse-ast ())
+(defun clj-parse-ast-full ())
+(defun clj-print-ast (node))
+(defun clj-parse-ast-str ())
+(defun clj-parse-ast-full-str ())
+(defun clj-print-ast-str (node))
+(defun clj-parse-source ())
+(defun clj-parse-source-full ())
+(defun clj-print-source (node))
+(defun clj-parse-source-str ())
+(defun clj-parse-source-full-str ())
+(defun clj-print-source-str (node))
+```
+
+This may seem like a massive API surface, but these will all be only one or
two lines, combining the generic parsing function with the right reducers.
+
+Beyond that we provide two sets of functions for working with AST nodes.
+
+- zipper-like functions for navigating and transforming ASTs
+- functions for reading values out of ASTs, effectively treating them as the
data structure they represent.
+
+This second set of functions is important for applications that need to deal
with EDN data, but for which the lossy nature of EDN->Elisp transformation is
not an option. For instance, unrepl sends EDN messages, but these messages
contain code forms that we need to be able to reproduce. In this case
converting `false` to `nil` or a set to a list is not acceptable. Instead we
can parse the EDN to an AST, and deal with the AST directly.
diff --git a/clj-lex-test.el b/clj-lex-test.el
index 405420afcb..30d1082625 100644
--- a/clj-lex-test.el
+++ b/clj-lex-test.el
@@ -63,6 +63,21 @@
(goto-char 1)
(should (equal (clj-lex-next) '((type . :symbol) (form . "hello-world")
(pos . 1)))))
+ (with-temp-buffer
+ (insert "-hello-world")
+ (goto-char 1)
+ (should (equal (clj-lex-next) '((type . :symbol) (form . "-hello-world")
(pos . 1)))))
+
+ (with-temp-buffer
+ (insert "#inst")
+ (goto-char 1)
+ (should (equal (clj-lex-next) '((type . :tag) (form . "#inst") (pos .
1)))))
+
+ (with-temp-buffer
+ (insert "#qualified/tag")
+ (goto-char 1)
+ (should (equal (clj-lex-next) '((type . :tag) (form . "#qualified/tag")
(pos . 1)))))
+
(with-temp-buffer
(insert "\\newline\\return\\space\\tab\\a\\b\\c")
(goto-char 1)
@@ -101,6 +116,11 @@
(goto-char 1)
(should (equal (clj-lex-next) (clj-lex-token :keyword ":hello-world" 1))))
+ (with-temp-buffer
+ (insert ":hello/world")
+ (goto-char 1)
+ (should (equal (clj-lex-next) (clj-lex-token :keyword ":hello/world" 1))))
+
(with-temp-buffer
(insert "::hello-world")
(goto-char 1)
@@ -184,6 +204,7 @@
(should (equal (clj-lex-symbol-start? ?a) t))
(should (equal (clj-lex-symbol-start? ?A) t))
(should (equal (clj-lex-symbol-start? ?.) t))
+ (should (equal (clj-lex-symbol-start? ?. t) nil))
(should (equal (clj-lex-symbol-start? ?~) nil))
(should (equal (clj-lex-symbol-start? ? ) nil)))
@@ -195,6 +216,29 @@
(should (equal (clj-lex-symbol-rest? ?~) nil))
(should (equal (clj-lex-symbol-rest? ? ) nil)))
+(ert-deftest clj-lex-test-get-symbol-at-point ()
+ (with-temp-buffer
+ (insert "a-symbol")
+ (goto-char 1)
+ (should (equal (clj-lex-get-symbol-at-point 1) "a-symbol"))
+ (should (equal (point) 9))))
+
+(ert-deftest clj-lex-test-invalid-tag ()
+ (with-temp-buffer
+ (insert "#.not-a-tag")
+ (goto-char 1)
+ (should (equal (clj-lex-next) '((type . :lex-error) (form . "#.not-a-tag")
(pos . 1) (error-type . :invalid-hashtag-dispatcher)))))
+
+ (with-temp-buffer
+ (insert "#-not-a-tag")
+ (goto-char 1)
+ (should (equal (clj-lex-next) '((type . :lex-error) (form . "#-not-a-tag")
(pos . 1) (error-type . :invalid-hashtag-dispatcher)))))
+
+ (with-temp-buffer
+ (insert "#+not-a-tag")
+ (goto-char 1)
+ (should (equal (clj-lex-next) '((type . :lex-error) (form . "#+not-a-tag")
(pos . 1) (error-type . :invalid-hashtag-dispatcher))))))
+
(ert-deftest clj-lex-test-string ()
(with-temp-buffer
(insert "\"abc\"")
diff --git a/clj-lex.el b/clj-lex.el
index af6581ed01..af07cd40bd 100644
--- a/clj-lex.el
+++ b/clj-lex.el
@@ -25,7 +25,7 @@
(defun clj-lex-token (type form pos &rest args)
`((type . ,type)
(form . ,form)
- (pos . , pos)
+ (pos . ,pos)
,@(mapcar (lambda (pair)
(cons (car pair) (cadr pair)))
(-partition 2 args))))
@@ -81,26 +81,34 @@
(and (member char '(?- ?+ ?.))
(clj-lex-digit? (char-after (1+ (point))))))))
-(defun clj-lex-symbol-start? (char)
+(defun clj-lex-symbol-start? (char &optional alpha-only)
"Symbols begin with a non-numeric character and can contain
- alphanumeric characters and . * + ! - _ ? $ % & = < >. If -, +
- or . are the first character, the second character (if any)
- must be non-numeric."
+alphanumeric characters and . * + ! - _ ? $ % & = < >. If -, + or
+. are the first character, the second character (if any) must be
+non-numeric.
+
+In some cases, like in tagged elements, symbols are required to
+start with alphabetic characters only. ALPHA-ONLY ensures this
+behavior."
(not (not (and char
(or (and (<= ?a char) (<= char ?z))
(and (<= ?A char) (<= char ?Z))
- (member char '(?. ?* ?+ ?! ?- ?_ ?? ?$ ?% ?& ?= ?< ?>
?/)))))))
+ (and (not alpha-only) (member char '(?. ?* ?+ ?! ?- ?_ ??
?$ ?% ?& ?= ?< ?> ?/))))))))
(defun clj-lex-symbol-rest? (char)
(or (clj-lex-symbol-start? char)
(clj-lex-digit? char)))
+(defun clj-lex-get-symbol-at-point (pos)
+ "Return the symbol at point."
+ (while (clj-lex-symbol-rest? (char-after (point)))
+ (right-char))
+ (buffer-substring-no-properties pos (point)))
+
(defun clj-lex-symbol ()
(let ((pos (point)))
(right-char)
- (while (clj-lex-symbol-rest? (char-after (point)))
- (right-char))
- (let ((sym (buffer-substring-no-properties pos (point))))
+ (let ((sym (clj-lex-get-symbol-at-point pos)))
(cond
((equal sym "nil") (clj-lex-token :nil "nil" pos))
((equal sym "true") (clj-lex-token :true "true" pos))
@@ -162,10 +170,7 @@
(when (equal (char-after (point)) ?:)
(right-char))
(if (clj-lex-symbol-start? (char-after (point)))
- (progn
- (while (clj-lex-symbol-rest? (char-after (point)))
- (right-char))
- (clj-lex-token :keyword (buffer-substring-no-properties pos (point))
pos))
+ (clj-lex-token :keyword (clj-lex-get-symbol-at-point pos) pos)
(progn
(right-char)
(clj-lex-token :lex-error (buffer-substring-no-properties pos (point))
pos 'error-type :invalid-keyword)))))
@@ -221,13 +226,21 @@
((equal char ?#)
(right-char)
(let ((char (char-after (point))))
- (cl-case char
- (?{
- (right-char)
- (clj-lex-token :set "#{" pos))
- (?_
- (right-char)
- (clj-lex-token :discard "#_" pos)))))
+ (cond
+ ((equal char ?{)
+ (right-char)
+ (clj-lex-token :set "#{" pos))
+ ((equal char ?_)
+ (right-char)
+ (clj-lex-token :discard "#_" pos))
+ ((clj-lex-symbol-start? char t)
+ (right-char)
+ (clj-lex-token :tag (concat "#" (clj-lex-get-symbol-at-point (1+
pos))) pos))
+ (t
+ (while (not (or (clj-lex-at-whitespace?)
+ (clj-lex-at-eof?)))
+ (right-char))
+ (clj-lex-token :lex-error (buffer-substring-no-properties pos
(point)) pos 'error-type :invalid-hashtag-dispatcher)))))
":("))))
- [nongnu] elpa/parseclj 56bb020a9e 020/185: Fix the travis yaml?, (continued)
- [nongnu] elpa/parseclj 56bb020a9e 020/185: Fix the travis yaml?, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 972161dd23 022/185: Make sure we test against the version we think we are testing against, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 792ba04647 024/185: A more elaborate test, nesting works!, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 452fe7cc76 023/185: Implement nil, true, false, symbol, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 67171853f5 026/185: lex characters, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 4f54ba52fe 033/185: Add support for sets, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 4f647c8cff 036/185: Remove duplicate defination of clj-lex--token-token, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 1b2b221c98 038/185: Add AST to Elisp and Clojure/EDN printers, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 9404763e11 042/185: Make sure Travis uses the right Emacs, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 29411e2103 045/185: Add a design document to share vision and collect feedback, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 0702332e5d 046/185: Merge branch 'master' of https://github.com/lambdaisland/clj-parse,
ELPA Syncer <=
- [nongnu] elpa/parseclj 553f8618e1 044/185: Rework `clj-parse-deftest` macro, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj da89bf82a3 050/185: Allow colons inside symbols, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 81fe979450 051/185: Handle parsing of semicolon-based comments., ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj a424f87c1b 054/185: :#/# is a valid keyword. :::hello is not, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj f34b8460a8 055/185: Correctly parse numbers in scientific notation, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 6fe4ce6095 058/185: Add script to compare speed of edn.el and clj-parse.el, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 7d70ee4c38 059/185: Add support for tagged literals, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 4367db07aa 061/185: Move tests to tests/, bench to benchmark/, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj 46570ce843 060/185: A # can be part of a symbol, ELPA Syncer, 2021/12/28
- [nongnu] elpa/parseclj fcd1a086f6 081/185: Rename clj-ast to parseclj-ast, ELPA Syncer, 2021/12/28