bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

various PuTTY/groff/man/less bugs


From: Mark Whitis
Subject: various PuTTY/groff/man/less bugs
Date: Fri, 18 Jun 2004 09:56:07 -0400 (EDT)

This report discusses problems which were traced to bugs in no less
than four separate programs: PuTTY, man, groff, and less.  Version
information and assignment of bugs to specific programs is near
the bottom of this email.  Some of these bugs went away with version
upgrade, some did not.

I had the problem discussed here:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&safe=off&threadm=CMwB9.19759%24%25m4.6633%40rwcrnsc52.ops.asp.att.net&rnum=3&prev=/groups%3Fq%3Dman%2BPuTTY%26hl%3Den%26lr%3D%26ie%3DUTF-8%26safe%3Doff%26selm%3DCMwB9.19759%2524%2525m4.6633%2540rwcrnsc52.ops.asp.att.net%26rnum%3D3
And another related problem.

Specifically, when running "man", dashes in the output are displayed as
an accented "a^" (latin1/iso-8859-1 character E2) (or sometimes E2
followed by additional characters).   And running "nroff -man" resulted

Partial Fix (works for most dashes in most man pages):
   - Change PuTTY translation setting to UTF-8
   - Apply
   - rerun the command.
Workaround for remaining issues:
   gunzip </usr/share/man/man1/perlrun.1.gz | nroff -man -Tascii -c | less -isR
   gunzip </usr/share/man/man1/perlrun.1.gz | nroff -man -Tascii | less -isR

One of the confusing things that makes this fix non-obvious is that when 
you change the setting from ISO-8859-1 (latin1) to UTF-8, the display 
changes and the funny characters are replaced with empty boxes.  This gives 
the user the impression that this setting already took effect and didn't help.

perlrun is a man page that illustrates mixed use of dashes.  UTF-8 fixes
synopsis but first paragraph of description and some code samples
have minus signs instead of dashes.

Also "export CHARSET=latin1" does not help.  Setting "LESSCHARSET"
to "latin1" or "iso-8859-1" changes from a bold E2 character to a
non bold E2 character.

Note however, that "nroff -man" is still broken in another way (before
or after chanign PuTTY setting).   Bold words get displayed as
<ESC>[lmbold<ESC>[0n and there are lots of <ESC>[22m and <ESC>[24
(ANSI color settings).  "nroff -man -a" dies.  "nroff -man -Tascii" 
doesn't help (actually, it fixes the dash problem but you don't notice
that due to the ANSI color problem).    "nroff -man -c" eliminates ANSI 
color but not inappropriate use of UTF-8 for various types of dashes.
The combination "nroff -man -Tascii -c" fixes things.

What is frustrating is that the man command doesn't
seem to have "ascii flavored ascii, damnit!" options.
And setting "export 
TERM=putty" and/or "export 
CHARSET=latin1" don't work.  And of course, you can't read the
man pages for man and groff.  And piping the output to "od -a"
actual shoes dashes as "b  bs  dc2" because od ignores the
high order bit; using "od -t x1 -t a" or better yet "hexdump -c"
shows it as "e2 88 92".  The escape sequence problems with
groff persist even if you set TERM to vt220, vt100, ansi77,
glasstty, or vanilla.  However, it looks like the problem is "less"
and not nroff; if you replace "less" with "more", the problem
goes away.   Using "less -R" fixes this problem by telling less
not to protect the terminal from annoying or malicious escape sequences 
by converting control characters and escape sequences to readable 
equivalents.  Also note that less looks at "JLESSCHARSET" or "LESSCHARSET" 
instead of "CHARSET".
 
"man less" has its own problems involving the character sequence
"e2 80 90" displayed as a hollow box instead of a dash.  I think
we have an issue between "endash" and "emdash" and PuTTY displays
one correctly and not the other.  It seems that the manpage nroff
files for some man pages use "-" (endash, like less) for a dash and most
use "\-" (minus sign).  There is also "--" which gets converted to 
"emdash".

http://homepages.comnet.co.nz/~r-mahoney/bca_text/utf8.html gives us
the UTF-8 values of the various forms of dashes:

   Description: HYPHEN     <E2><80><90> [ ‐ ]
   Description: EN DASH    <E2><80><93> [ - ]
   Description: EM DASH    <E2><80><94> [ - ]
   Description: MINUS SIGN <E2><88><92> [ − ]
   Description: LEFT-POINTING ANGLE BRACKET   <E2><8C><A9> [ 〈 ]
   Description: RIGHT-POINTING ANGLE BRACKET  <E2><8C><AA> [ 〉 ]
   Description: LEFT SINGLE QUOTATION MARK  <E2><80><98> [ ' ]
   Description: RIGHT SINGLE QUOTATION MARK  <E2><80><99> [ ' ]
   Description: SINGLE LOW-9 QUOTATION MARK  <E2><80><9A> [ ' ]
   Description: LEFT DOUBLE QUOTATION MARK   <E2><80><9C> [ " ]
   Description: RIGHT DOUBLE QUOTATION MARK  <E2><80><9D> [ " ]
   Description: DOUBLE LOW-9 QUOTATION MARK  <E2><80><9E> [ " ]
Download this file to check less or PuTTY:
http://homepages.comnet.co.nz/~r-mahoney/bca_text/utf8.txt

Users expect to be able to pipe output of commands to less and have it
work.  While it is good that less filters nasty characters (transmit
25th line being a security hole), UTF-8 and
ANSI sequences are common in the output of programs.  

List of common environment variables: 
http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html
http://www.wlug.org.nz/EnvironmentVariable

After upgrading to latest redhat 9 versions:
Get:1 http://ayo.freshrpms.net redhat/9/i386/os groff 1.18.1-20 [1891kB]
Get:2 http://ayo.freshrpms.net redhat/9/i386/os less 378-7 [101kB]
Get:3 http://ayo.freshrpms.net redhat/9/i386/os man 1.5k-6 [90.9kB]
less manpage displays ok.   man and nroff both default to no color (backspaces
 used for bold).  perlrun page still has a problem with E28099 
(right single quote) in the synopsis section (no unicode in raw manpage) and 
code examples if terminal is in ISO-8859-1 mode but works ok in UTF-8 
mode.  Less still chokes on color sequences without -R "ls 
--color | less" but it passes utf-8 unmangled by default.
New groff man page has problems with E28CA9 E28CAA around URL with term
in UTF-8 mode.

PuTTY emacs bug: PuTTY gets confused about cursor position within a line
sometimes while editing a file in C mode which cause one to trash the line 
being edited.  Meanwhile, pico seems to be ok.  Try creating the
file below by copying one printf repeatly and then selectively editing
values, that should be enough to reproduce.

/* Crude test program */
main()
{
  printf("E28090:%c%c%c\n",0xE2,0x80,0x90);
  printf("E28093:%c%c%c\n",0xE2,0x80,0x93);
  printf("E28094:%c%c%c\n",0xE2,0x80,0x94);
  printf("E28892:%c%c%c\n",0xE2,0x88,0x92);
  printf("E28099:%c%c%c\n",0xE2,0x80,0x90);

  printf("E28CA9:%c%%%c\n",0xE2,0x8C,0xA9);
  printf("E28CAA:%c%c%c\n",0xE2,0x8C,0xAA);

  printf("E28C98:%c%c%c\n",0xE2,0x8C,0x98);
  printf("E28C99:%c%c%c\n",0xE2,0x8C,0x99);
  printf("E28C9A:%c%c%c\n",0xE2,0x8C,0x9A);
  printf("E28C9C:%c%c%c\n",0xE2,0x8C,0x9C);
  printf("E28C9D:%c%c%c\n",0xE2,0x8C,0x9D);
  printf("E28C9E:%c%c%c\n",0xE2,0x8C,0x9E);

}

Bugs:
   less  - honor CHARSET [still broken]
           man page uses "-" instead of "\-" 
              which isn't consistent with most man pages
           Allow safe UTF-8 characters and ANSI color
              sequences without -R, unless explictly turned off
              if terminal is ANSI flavored (TERM) and charset
              is UTF.  [UTF-8 fixed, color still broken]
           possibly convert UTF-8 dash characters if CHARSET or
              LESSCHARSET says latin1
           update FAQ
           E28CA9 kills newline imediately following [Not tested earlier]
            
   groff - 
           add terminal issues to nonexistant documentation 
                troubleshooting section
           -Tascii does not supress ANSI color.
              If this is intentional, need to mention "-c"
              [color now defaults off]
           "-c disable color output" should mention ANSI escape
               sequences, giberish, etc.        
           "nroff -a" dies [still broken]
           [now that color defaults off, there doesn't appear to be a way
           to turn on]   

   PuTTY - Handle E28090, E28099,E28CA9,E28CAA,E28C98,E28C99,
           Potential future gotchas:
             E28C9A, E28C9C, E28C9D, E28C9E more quote characters
             The E2808x space characters
             E28096 double vertical line
             E280A5 ..
             E280B4 Triple prime ''', E280B5 reversed prime `
             E28183 Hyphen Bullet -
             E28897 asterisk operator * 
             E28898 ring operator (bullet)
             E288A3 divides |
             E288BC Tilde ~  E288BD reversed tilde ~
             E28994 Colon equals :=  E28995 Equals colon =:
             E289AA Much less than << E289AB Much grater than >>
             E28B86 Star operator *
             E28B98 Very Much less than <<< E28B99 very much greater >>>
             These characters are important because typographic fanatics
             may substitute them for standard ASCII chars.
            Less important
             Some of the CF9x greek characters
             E2839B/E2839C (web browser doesn't get those either)
             E28181 Caret Insertion point (unusual)
             E284xx E285xx E286xx E287xx E288xx E289xx E28Axx various chars
             etc.
             E29980 female sign displays wrong char (down arrow) 
           update FAQ
           emacs bug

   man   - "ASCII flavored ASCII, damnit!" option needed
           "groff -Tascii -c"  [Now defaults -c but still need -Tascii option]

Linux distribution: redhat 8.0 
groff version: 1.18
Less version: 358+iso254
man version 1.5j
Updated versions: 
   Get:1 http://ayo.freshrpms.net redhat/9/i386/os groff 1.18.1-20 [1891kB]
   Get:2 http://ayo.freshrpms.net redhat/9/i386/os less 378-7 [101kB]
   Get:3 http://ayo.freshrpms.net redhat/9/i386/os man 1.5k-6 [90.9kB]
Putty version: 0.53b, Windows XP, 
Fonts:
  Courier New 10pt       E28090 = hollow box
  Courier New 8pt        E28090 = hollow box
  Fixed Sys 10pt         E28090 = solid box
  Lucida Console 10pt    E28090 = thin hollow box
  Terminal 10pt          E28090 = invisable
  WP Multinational       everything is gibberish
  Courier 10 pt          E28090 = thin solid box   
CHARSET environment variable: unset, various values above
TERM environment variable: xterm except as noted above
LANG environment variable: en_US.UTF-8

Text from this message may be incorporated into documentation/FAQ/bug 
reports, etc.


--
Mark Whitis   http://www.freelabs.com/~whitis/       NO SPAM
Author of many open source software packages.  
Coauthor: Linux Programming Unleashed (1st Edition)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]