bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #62264] string iteration handles escape sequences inconsistently


From: G. Branden Robinson
Subject: [bug #62264] string iteration handles escape sequences inconsistently
Date: Thu, 7 Apr 2022 12:49:47 -0400 (EDT)

URL:
  <https://savannah.gnu.org/bugs/?62264>

                 Summary: string iteration handles escape sequences
inconsistently
                 Project: GNU troff
            Submitted by: gbranden
            Submitted on: Thu 07 Apr 2022 04:49:45 PM UTC
                Category: Core
                Severity: 3 - Normal
              Item Group: Incorrect behaviour
                  Status: Postponed
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
         Planned Release: None

    _______________________________________________________

Details:

So I'm trying to fix Savannah bug #62257 and I ran into good news and bad
news.

The bad news, I already knew, which is why the bug got filed--you can't just
treat a groff string as a character array like you would a character string
that used a fixed-width character encoding (like ASCII, ISO 8859, UTF-16LE).

No, sometimes an index will land you in the middle of an escape sequence,
which means if you edit the string at that point you risk creating syntactical
nonsense.

The good news is that the strings I'm trying to edit (to abbreviate them to
keep them overrunning parts of man page headers and footers) are arguments to
the man(7) `TH` macro, which tend to be really well behaved: we abbreviate the
page title, which will almost always be (nearly) pure ASCII due to keyboard
input practicalities, and the "extra2" argument, which by preponderant
tradition is the name and version of the project responsible for the man page.
 This _also_ tends to be a well-behaved string because it needs to be easily
googled and similar.

So these strings confine themselves to ASCII except for damnable old friend,
the hyphen-minus.  And that's not too rare, which seems bad (although rarer
than it should be thanks to the efforts of the ASCII Puritan Reactionary
Underground)...

*BUT*

The even better news is that GNU troff's string iterator _recognizes_ the \-
escape sequence and hands it back to you as an atomic unit!  It also does this
for \`, \', and \_, none of which we really need, but which may aid future
research.

I started writing a macro to scan a string for escape sequences so I could
bail out of the abbreviation process instead of corrupting an escape-happy
string.

Here are my findings.


$ cat EXPERIMENTS/string-contains-escape.man
.an*string-contains-escape foo\-bar
.an*string-contains-escape foo\`bar
.an*string-contains-escape foo\'bar
.an*string-contains-escape foo\_bar
.an*string-contains-escape foo\(hybar
.an*string-contains-escape foo\[hy]bar
.an*string-contains-escape caf\['e] bar
.an*string-contains-escape caf\[e aa] bar
$ ./build/test-groff -b -ww -Tutf8 -man
EXPERIMENTS/string-contains-escape.man
GBR: an*string-contains-escape: 'foo\-bar'
GBR: string[0]='f'
GBR: string[1]='o'
GBR: string[2]='o'
GBR: string[3]='\-'
GBR: string[4]='b'
GBR: string[5]='a'
GBR: string[6]='r'
GBR: an*string-contains-escape: 'foo\`bar'
GBR: string[0]='f'
GBR: string[1]='o'
GBR: string[2]='o'
GBR: string[3]='\`'
GBR: string[4]='b'
GBR: string[5]='a'
GBR: string[6]='r'
GBR: an*string-contains-escape: 'foo\'bar'
GBR: string[0]='f'
GBR: string[1]='o'
GBR: string[2]='o'
GBR: string[3]='\''
GBR: string[4]='b'
GBR: string[5]='a'
GBR: string[6]='r'
GBR: an*string-contains-escape: 'foo\_bar'
GBR: string[0]='f'
GBR: string[1]='o'
GBR: string[2]='o'
GBR: string[3]='\_'
GBR: string[4]='b'
GBR: string[5]='a'
GBR: string[6]='r'
GBR: an*string-contains-escape: 'foo\(hybar'
GBR: string[0]='f'
GBR: string[1]='o'
GBR: string[2]='o'
GBR: string[3]='\'
GBR: string[4]='('
GBR: string[5]='h'
GBR: string[6]='y'
GBR: string[7]='b'
GBR: string[8]='a'
GBR: string[9]='r'
GBR: an*string-contains-escape: 'foo\[hy]bar'
GBR: string[0]='f'
GBR: string[1]='o'
GBR: string[2]='o'
GBR: string[3]='\'
GBR: string[4]='['
GBR: string[5]='h'
GBR: string[6]='y'
GBR: string[7]=']'
GBR: string[8]='b'
GBR: string[9]='a'
GBR: string[10]='r'
GBR: an*string-contains-escape: 'caf\['e] bar'
GBR: string[0]='c'
GBR: string[1]='a'
GBR: string[2]='f'
GBR: string[3]='\'
GBR: string[4]='['
GBR: string[5]='''
GBR: string[6]='e'
GBR: string[7]=']'
GBR: string[8]=' '
GBR: string[9]='b'
GBR: string[10]='a'
GBR: string[11]='r'
GBR: an*string-contains-escape: 'caf\[e aa] bar'
GBR: string[0]='c'
GBR: string[1]='a'
GBR: string[2]='f'
GBR: string[3]='\'
GBR: string[4]='['
GBR: string[5]='e'
GBR: string[6]=' '
GBR: string[7]='a'
GBR: string[8]='a'
GBR: string[9]=']'
GBR: string[10]=' '
GBR: string[11]='b'
GBR: string[12]='a'
GBR: string[13]='r'


This will make completion of the fix for bug #62257 straightforward as long as
I can use the output comparison operator.

I'm filing this bug because the behavior was so surprising and as far as I
know it's not documented anywhere.  It feels squicky that _some_ escape
sequences get extracted atomically, though I can imagine why that's true (the
simple cases currently handed don't require recursive interpolation).

The inconsistency should either be fixed or documented.  Even to be
documented, it needs to be better understood.

I don't remotely want to tackle this before groff 1.23 is released.




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?62264>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]