Re: [PATCH] Interpret #r"..." as a raw string

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Interpret #r"..." as a raw string

From:	Daniel Brooks
Subject:	Re: [PATCH] Interpret #r"..." as a raw string
Date:	Tue, 02 Mar 2021 01:56:43 -0800
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Matt Armstrong <matt@rfc20.org> writes:

> Alan Mackenzie <acm@muc.de> writes:
>
> C++ has probably the most flexible "gold standard" raw string literals.

With respect, I think that Raku “wins” this
fight. https://docs.raku.org/language/quoting is really worth reading;
it's a work of art. You can think of the quote operator as a function
that takes 13 named boolean arguments plus a choice of opening and
closing delimiters.

> As Alan I think rightly points out, this makes the language and all
> tools that process the language more complex.  This is a high cost, so
> the feature should deliver some real value.

Certainly true. As the ordinary Lisp string syntax already allows
multi-line strings, and interpolation is handled by the format function,
the primary benefit is to turn off escaping. We could also offer a
choice of opening and closing delimiters, though the proposed code
didn't implement that.

I think the benefit will be worth it. If we offered a little more choice
of delimiters, then we could gain more benefit when the string must also
contain double quotes. This need have a large complexity cost.

> For those that don't know, C++'s raw string literals can be as imple as
> this for the string "raw-content":
>
>    R"(raw-content)"
>
> But if the content itself contains the character sequence )" then the
> programmer can specify any delimiter they want:
>
>    R"DELIMITER(raw-content)"more-raw-content)DELIMITER"
>
> But as you can see above, it isn't always clearer to write a raw string
> literal.

I would say that there are four ways to choose the delimiters.

The simplest way is just accepting just one specific delimiter, often
with no way to include that character in the string. For example,
Scala's syntax is raw"foo", but without any form of escaping that will
allow a double quote inside the string. C#'s syntax is @"foo", but you
can include a double-quote by repeating it, so @"foo""bar" is the string
”foo"bar”. Most languages are in this category, and this is how the
proposed code works.

Then there is the sed→perl→raku way, where the parser accepts a wide
variety of characters as the opening delimiter, and uses it to compute
which closing delimiter to look for. Raku allows any character not
allowed in identifiers, which is most characters not in the L or N
Unicode categories. Sed and Perl just allow punctuation characters.

There is the Rust way, where the parser looks for a double-quote
proceeded by zero or more #'s. The closing delimiter is a double-quote
followed by the same number of #'s.

And finally the C++11 way, where it looks for a double-quote followed by
zero to sixteen source characters (with a few minor exceptions) followed
by an opening parenthesis. The closing delimiter is a closing
parenthesis followed by the same zero to sixteen characters in the same
order as in the opening delimiter followed by a double-quote character.

Of the three, I think Raku's way is the most fun because it allows the
widest choice of characters (q🕶awesome!🕶, for example). I'd be fine with
the current proposal, but if others think that it is important to allow
double-quotes inside the raw string, then I think Rust's syntax is the
next logical step. #r##"foo"## would fit in well with the rest of elsip;
it won't look as out of place as the others, and it's only a small
increment in compexity.

Or maybe we want to invent something completely new. As Emacs buffers
may include images which are treated as if they were characters of
unusual size, perhaps we could use gifs. A string bracketed by a GIF of
a dude putting on sunglasses would really show those other languages up.

As it's nicer when delimiters are paired, we could allow the closing GIF
to be horizontally mirrored so that both dudes are either looking
inwards at the string or outwards at the rest of the world.

db48x

PS: if anyone wants to go the Perl/Raku way, I happen to have built a
list of the paired punctuation characters recently:

var _PiPf = map[rune]rune{
        '«': '»', '‘': '’', '“': '”', '‹': '›', '⸂': '⸃', '⸄': '⸅', '⸉': '⸊',
        '⸌': '⸍', '⸜': '⸝', '⸠': '⸡',
}

var _PsPf = map[rune]rune{
        '‚': '’', '„': '”',
}

var _PsPe = map[rune]rune{
        '(': ')', '[': ']', '{': '}', '༺': '༻', '༼': '༽', '᚛': '᚜', '⁅': '⁆',
        '⁽': '⁾', '₍': '₎', '❨': '❩', '❪': '❫', '❬': '❭', '❮': '❯', '❰': '❱',
        '❲': '❳', '❴': '❵', '⟅': '⟆', '⟦': '⟧', '⟨': '⟩', '⟪': '⟫', '⦃': '⦄',
        '⦅': '⦆', '⦇': '⦈', '⦉': '⦊', '⦋': '⦌', '⦑': '⦒', '⦓': '⦔', '⦕': '⦖',
        '⦗': '⦘', '⧘': '⧙', '⧚': '⧛', '⧼': '⧽', '〈': '〉', '《': '》',
        '「': '」', '『': '』', '【': '】', '〔': '〕', '〖': '〗', '〘': '〙',
        '〚': '〛', '〝': '〞', '︗': '︘', '︵': '︶', '︷': '︸', '︹': '︺',
        '︻': '︼', '︽': '︾', '︿': '﹀', '﹁': '﹂', '﹃': '﹄', '﹇': '﹈',
        '﹙': '﹚', '﹛': '﹜', '﹝': '﹞', '（': '）', '［': '］', '｛': '｝',
        '｟': '｠', '｢': '｣', '⸨': '⸩',
}

var _SmSm = map[rune]rune{
        '<': '>',
}

This is obviously written in Go. My source code is at
https://github.com/db48x/goparsify/blob/master/literals.go#L298-L322.

Feel free to use these tables however you like; I consider them to be a
mere listing of facts and as such they're not copyrightable.

The basic algorithm that Perl uses is that the delimiter may be any
punctuation character, and if the opening delimiter is a key in any of
these tables then the closing delimiter is expected to be the
corresponding value; otherwise the closing delimiter is expected to be
identical to the opening delimiter.

Raku is similar, execept that it allows any unicode character that isn't
designated as belonging to identifiers rather than just punctuation.

For speed you'll obviously prefer to do a single lookup into one hash
table, but for organizational purposes it's nicer to have them grouped
by unicode category. This will help you update them when new characters
are added in the future.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH] Interpret #r"..." as a raw string, (continued)
- Re: [PATCH] Interpret #r"..." as a raw string, Richard Stallman, 2021/03/01
- Re: [PATCH] Interpret #r"..." as a raw string, Alan Mackenzie, 2021/03/01
  - Re: [PATCH] Interpret #r"..." as a raw string, Andreas Schwab, 2021/03/01
  - Re: [PATCH] Interpret #r"..." as a raw string, Matt Armstrong, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Daniel Brooks <=
    - Re: [PATCH] Interpret #r"..." as a raw string, Andreas Schwab, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Daniel Brooks, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Andreas Schwab, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Daniel Brooks, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Alan Mackenzie, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Daniel Brooks, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Dmitry Gutov, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Alan Mackenzie, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Dmitry Gutov, 2021/03/02
    - Re: [PATCH] Interpret #r"..." as a raw string, Alan Mackenzie, 2021/03/02

Prev by Date: Re: policy discussion on bundling ELPA packages in the emacs tarball - take 3
Next by Date: Re: [PATCH] Interpret #r"..." as a raw string
Previous by thread: Re: [PATCH] Interpret #r"..." as a raw string
Next by thread: Re: [PATCH] Interpret #r"..." as a raw string
Index(es):
- Date
- Thread