aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aspell-user] Aspell removes some Polish characters if language is


From: Przemysław 'Przemoc' Pawełczyk
Subject: Re: [Aspell-user] Aspell removes some Polish characters if language is non-PL
Date: Tue, 20 Jan 2009 01:07:26 +0100

>> So the problem is with the lossy conversion from e.g. UTF8 to some
>> 8-bit character set.
>
>> In my opinion aspell should detect words with
>> unsupported characters (in current language) and store/print them
>> without any conversion.
>> Otherwise some languages might look privileged in some dictionaries
>> (vide German in English dictionaries). The control over "language
>> privileges" it is not in user hand and that is not a good solution.
>
> I do not fully understand what you are saying, however to me it makes little
> sense to store foreign words in a dictionary, for example German in an
> English dictionary.   The only exception might be foreign names, but I don't
> want to get into that.

All the time I was/am saying about national characters selectively
(and without any user control) removed by aspell with other language
set (e.g. Polish characters in English dictionary).
E.g. English dictionary doesn't have Polish words, so 'aspell list'
mustn't break them.

> Now the problem you are having is that Aspell is not recognizing foreign
> characters as part of the word.  This is because it assumes any characters
> it does not know about in the current language (ie not in the 8-bit
> character set for the language) is not part of a word.  To fix this it will
> be necessary to recreate the dictionaries from source replacing the current
> character set with a special expanded one which includes all characters in
> the Latin script.

OK.

> For the English language do this.  Download and unpack the English
> dictionary from:
>  ftp://ftp.gnu.org/gnu/aspell/dict/en/aspell6-en-6.0-0.tar.bz2
> and get aspell-lang from cvs using:
>  cvs -z3 -d:pserver:address@hidden:/sources/aspell co
> aspell-lang
>
> Go into the "aspell-lang" directory and create the expanded character set
> using:
>  ./mkchardata maps/iso-8859-1-u.txt
>
> Now copy some files from aspell-lang to aspell6-en-6.0-0
>  cp aspell-lang/maps/iso-8859-1-u.cset aspell6-en-6.0-0
>  cp aspell-lang/maps/iso-8859-1-u.cmap aspell6-en-6.0-0
>  cp -p aspell-lang/proc aspell6-en-6.0-0
>
> Now go into "aspell6-en-6.0-0".
>
> Edit the file "en.dat" and change "iso8859-1" to "iso8859-1-u".  Also edit
> en_affix.dat and change "ISO8859-1" to "ISO8859-1-U".
>
> In "info" add the lines:
>  data-file iso-8859-1-u.cset
>  data-file iso-8859-1-u.cmap
> (doesn't really matter where)
>
> Now regenerate the other files:
>  ./proc
>
> And finally build the dictionary:
>  ./configure
>  make

Yes! Now it works as expected:

$ echo 'äöüß ÄÖÜ ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ' | aspell list --encoding=utf-8 -d ./en
äöüß
ÄÖÜ
ąćęłńóśźż
ĄĆĘŁŃÓŚŹŻ

Polish characters can be found in ISO-8859-2 character set. It was
widely used in pre-UTF8 days (and still is in many places).
Changing ISO-8859-1 to ISO-8859-2 instead of ISO-8859-1-U would also
work? Or there are some other dependencies behind the scene?

> And maybe install it:
>  make install

I don't like messing with existing packages, so... I won't install it. :)

> For other languages do a similar thing.
>
> For more info in the expanded character set see "B.1.1 Notes on Latin
> Languages" in the manual (http://aspell.net/man-html/Supported.html) and the
> README in aspell-lang.

Thank you very much for your help. I fervently encourage you to put
this solution also in bug tracker.

Regards.

-- 
Przemysław 'Przemoc' Pawełczyk
http://przemoc.net/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]