Re: Ispell and unibyte characters

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ispell and unibyte characters

From:	Agustin Martin
Subject:	Re: Ispell and unibyte characters
Date:	Fri, 13 Apr 2012 17:25:25 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Thu, Apr 12, 2012 at 10:01:30PM +0300, Eli Zaretskii wrote:
> I wrote:
> > I am still dealing with an open issue here. Some languages have non 7bit
> > wordchars, like Catalan middledot, and it should be converted to UTF-8 if
> > default communication language is changed to UTF-8.
> 
> Sorry, I don't understand: do you mean "non 8-bit wordchars"?  I don't
> think 7 bits is assumed anywhere.

I mean wordchars that cannot be represented in 7bit encoding, like Catalan
middledot (available in 8bit latin1)

> Assuming you did mean 8-bit, then why not use UTF-8 for Catalan from
> the get-go?  Only some languages can use single-byte encodings, and
> evidently Catalan is not one of them.  For that matter, why shouldn't
> aspell and hunspell use UTF-8 by default (something I already asked)?

[...]

> I don't understand what are you trying to accomplish by encoding
> OTHERCHARS in UTF-8.  What exactly is the problem with them being
> encoded in some 8-bit encoding?  Please explain.

Imagine a fake entry in the general list, either in ispell.el or provided
through `ispell-base-dicts-override-alist' (no accented chars for simplicity)

("catala8"
     "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1)

Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it
properly. I prefer to not use UTF-8 here, because I want the entry to also be
useful for ispell (and also be XEmacs incompatible). The best approach here
seems to decode the otherchars regexp according to provided coding-system.

I have noticed that there seems to be no need to encode the resulting string
in UTF-8, Emacs will know what to do with the decoded string.

I tested something like

 (dolist (adict ispell-dictionary-alist)
            (add-to-list 'tmp-dicts-alist
                         (list
                          (nth 0 adict)  ; dict name
                          "[[:alpha:]]"  ; casechars
                          "[^[:alpha:]]" ; not-casechars
                          (if ispell-encoding8-command
                              ;; Decode 8bit otherchars if needed
                              (decode-coding-string (nth 3 adict) (nth 7 adict))
                            (nth 3 adict)) ; otherchars
                          (nth 4 adict)  ; many-otherchars-p
                          (nth 5 adict)  ; ispell-args
                          (nth 6 adict)  ; extended-character-mode
                          (if ispell-encoding8-command
                              'utf-8
                            (nth 7 adict)))))

and seems to work well.

> I wrote:
> > but get a sgml-lexical-context error. Need to look more carefuly, so this
> > will take longer.

I have tested further and this seems to be an unrelated problem. Some time
ago I already noticed some problems with flyspell.el and sgml mode (in
particular psgml) regarding sgml-lexical-context error

sgml-lexical-context: Wrong type argument: stringp, nil

sometimes when running flyspell-buffer after enabling flyspell-mode. I am
also seing something like

Error in post-command-hook (flyspell-post-command-hook):
(wrong-type-argument stringp nil)

when enabling flyspell-mode from the beginning of my sgml buffer. Cannot
reproduce with emacs -Q, still trying to find where this comes from. Both
problems tested with emacs-snapshot_20120410.

For Debian I do not use sgml-lexical-context, but an improved version of old
regexp to try keeping things compatible with XEmacs. This seems to work well
and has some advantages over sgml-lexical-context

1) Is compatible with XEmacs
2) Is twice faster when using flyspell-buffer than sgml-lexical-context
3) Does not trigger above error.

I am considering to use this improved regexp instead of sgml-lexical-context
for above reasons, but this is another issue.

-- 
Agustin

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Ispell and unibyte characters, Agustin Martin, 2012/04/10
- Re: Ispell and unibyte characters, Eli Zaretskii, 2012/04/10
  - Re: Ispell and unibyte characters, Agustin Martin, 2012/04/12
    - Re: Ispell and unibyte characters, Eli Zaretskii, 2012/04/12
    - Re: Ispell and unibyte characters, Agustin Martin <=
    - Re: Ispell and unibyte characters, Eli Zaretskii, 2012/04/13
    - Re: Ispell and unibyte characters, Agustin Martin, 2012/04/13
    - Re: Ispell and unibyte characters, Stefan Monnier, 2012/04/13
    - Re: Ispell and unibyte characters, Agustin Martin, 2012/04/13
    - Re: Ispell and unibyte characters, Stefan Monnier, 2012/04/13
    - Re: Ispell and unibyte characters, Agustin Martin, 2012/04/14
    - Re: Ispell and unibyte characters, Stefan Monnier, 2012/04/15
    - Re: Ispell and unibyte characters, Agustin Martin, 2012/04/20
    - Re: Ispell and unibyte characters, Eli Zaretskii, 2012/04/20
    - Re: Ispell and unibyte characters, Agustin Martin, 2012/04/20

Prev by Date: Re: Asynchronous insertion / saving of buffers
Next by Date: Re: Ispell and unibyte characters
Previous by thread: Re: Ispell and unibyte characters
Next by thread: Re: Ispell and unibyte characters
Index(es):
- Date
- Thread