[aspell-announce] Official Non-English Word Lists Packages Now Available

aspell-announce
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell-announce] Official Non-English Word Lists Packages Now Available

From:	Kevin Atkinson
Subject:	[aspell-announce] Official Non-English Word Lists Packages Now Available
Date:	Fri, 15 Jun 2001 06:08:01 -0400 (EDT)
[Please redistribute this announcement as you see fit]

In an effort to make installing word lists for non-English languages
in Aspell straightforward I have decided to release foreign language
dictionaries in a standard format.

A preliminary version of my efforts are currently available at the
Aspell home page (http://aspell.sourceforge.net).  There you will find
support for the following languages: Breton (br), Catalan (ca), Czech
(cs), Danish (da), Dutch (nl), Esperanto (eo), Faroese (fo), French
(fo), French (fr), German (de), Italian (it), Norwegian (no), Polish
(pl), Russian (ru), Spanish (es), Swedish (sv).

Please check them out and let me know what you think, but please keep
in mind that these are preliminary and subject to change.  These
packages contain everything needed to add support for a given language
to Aspell.  These packages will also install the necessary files so that
the Word List will be correctly recognized by Pspell -- something that
is often not handled correctly by Word List author's Aspell packages.

Support for variants in languages (such as American, British,
Canadian, Swiss German, etc...) is not yet available as I am currently
not sure the best way to do this.  However, support *will* be available
by the next version of Pspell and Aspell as I am planning on also
packaging the English dictionaries this way and distribute it
separately.

Authors of the word lists used to create these packages are encouraged
to check them out to make sure I did things correctly.  In the near
future I will allow you (the word list author) to maintain the
packages your self, but for right now I want to maintain tight control
over them as the format is not quite finalized yet.

Also, I am especially interested in feedback from Package Maintainers
(RPM, Debian, etc.) as the Makefile is rather primitive and probably
does not install things correctly nor support the options needed by
maintainers to make package simple.  Suggestions are more than
welcome, but patches to the proc script (the script which does all the
real work) are even more welcome.  I am also strongly interested in what
you think about the layout of the dictionary files and language names.


                  Technical Notes for Word List authors
                        and Package Maintainers

In order to make things as straight forward, portable, and uniform, I
have decided to enforce the following rules.

1) All language names are now the two letter ISO code (en, da) etc.
2) The actual dictionary files must all start with the two letter
code and end in .rws and may only contain ASCII characters.
3) Alias are created using Aspell's multi files and not symbolic links.

These rules are very different from the current way dictionaries are
handled but I fell they will make life easier for everyone.

The reason for the first rule is because in the past the Aspell
language names were a mixture of the name spelled out in English and
the name spelled out in the native language and in some cases involved
non-ASCII characters which was just asking for trouble on non-Unix
like platforms and probably some older Unix ones.  Some people want as
far as doing it both ways by symbolically linking one language data
file to the other.  This amazingly worked but it is a complete abuse
of how languages names and data files are meant to be used.  Finally
others, thought that language variants (American, Swiss German, etc.)
should be considered separate languages and either attempt to specify
them as a language at the command line, for example trying "aspell
--lang=canadian ..." or creating separate data files for them.  All of
this did no good but to confuse people so I wanted to formalize this
and was originally planning on using the language name spelled out in
ASCII characters but released that in many cases I didn't know what
this should be so I decided to go with the universal known language
codes.

The second rule is there so that is is clear which words lists belong
to which languages.  I require them to be all ASCII characters for
maximum portability.  However, the end user is not expected to use
these words lists directly.  Instead they are expected to use one of
the aliases created via the .multi file.  These alias can be anything
what so every and may included non ASCII characters.  Symbolic links
are not used as there are Unix specific and not supported by Win32.
Non-ASCII characters are okay for aliases as they can simply not be
installed on platforms which don't support them.


                  Draft Documentation on the layout of
                        Aspell dicts packages

The overall goal of Aspell dicts is to provide a uniform method to
distribute dictionaries for Aspell for any language that Aspell
supports.

This documentation is still in an early stage and rather incomplete.
It is meant to give you enough of an overview so you know what is
going on, but probably won't be enough information for you to actually
create a distribution.

Layout of the Distribution:

An Aspell Word List Package contains several type of files, many of
them generated by the proc script.  These must be provided:

info: the main file which contains all of the important word lists
*.cwl: compressed word list files
Copyright: the copyright notice

And these are automatically generated/provided for you

configure: the configure script which finds the appropriate paths
nd generated the actual makefile.
??.dat: the data file for the language.
*.multi: the dictionary files
Makefile.pre: the makefile which configure uses.

And finally several optional ones.

??_phonet.dat: The optional phonet data file
README: A readme file.  If one is not provided a genetic one will be
created
COPYING: The actual license agreement.  Automatically provided for some
licenses
doc/* additional documentation

*** Format of the Info File

(Note: For a better idea of how this file is laid out see some of the
sample info files included)

The info file is the main file which contains most of the information.
It has two types of entries.  Single value settings, and group
settings.  Single value settings have the form:
  <key> <value>
And group settings which have the form:
  <group key>:
    <key> <value>
    <key> <value>
    ...
If there is ANY whitespace before a key it is assumed to belong to a
group entry.

The following Single value settings are mandatory:

name_english: The english name of the language
code: The two letter Code
copyright: The copyright one of:
  LGPL
  GPL
  FDL
  Artistic
  Copyrighted (Copyright message must remain)
  Open Source (Meets OSI definition)
  Public Domain (ie none)
  Other
  Unknown
version: A version string
charset: charset to use
soundslike: one of
  none
  generic
  phonet
If it is phonet the file <code>_phonet.dat is expected to be present

In addition there must be at least one of each of the following group
entries:

author:
  name: The name of the author
  email: The email address of the author.

Multiple author groups may be specified.

dict:  The defining entry for a dictionary
  name: The name of this dict
  alias: An alternate name (may be repeated)
  add: A word list to add (may be repeated)

For right now there should only be one dict entry and its name should
be the same as the language code.  The proc script has the ability to
handle more than one but I hav enot worked out the details yet.

In additional to the above the info file can also contain the following
optional entries

name_ascii: The language name in spelled in its own language in all
ascii characters
name_native: Like above but not limited to ASCII characters.

And a bunch of other entries which I will document latter.

*** The *.cwl

For each add entry in the dict entry there should in general be one
word list. Each of these words lists will be compiled into a separate
hash files so you should keep the number to a minimum.  Each file is
expected to have the following format:
  <code>[-...].cwl
These files are expected to be compressed with word-list-compress.  To
compress a file so something like the following
export LANG=C
cat <word list> | sort -u | word-list-compress c > <code>...cwl
the LANG=C is important or other wise the file will not be compressed
optionally.

*** Copyright file

The copyright file simply states the terms in which this word list is
available.  If the license is a standard one or is more than a
paragraph or so the actual license should be included in a separate
file "COPYING".  If you are using one of the GNU licenses the COPYING
file will automatically be generated for you.

*** running proc

Once the info and *.cwl files are created you are ready to run the
proc script.  To do so simply run type:
  perl proc create
and if there are no errors you should have the above listed generated
files.

To try building a word list run configure with
  ./configure

and then to build and install it
  make
  make install

To create a distribution do a
  make dist

-- 
Kevin Atkinson
kevina at users sourceforge net
http://www.ibiblio.org/kevina/
[Prev in Thread]
Current Thread
[Next in Thread]
[aspell-announce] Official Non-English Word Lists Packages Now Available, Kevin Atkinson <=
Prev by Date: [aspell-announce] Aspell .33.6.3 Now Available
Next by Date: [aspell-announce] New Version of Aspell Dicts now Available
Previous by thread: [aspell-announce] Aspell .33.6.3 Now Available
Next by thread: [aspell-announce] New Version of Aspell Dicts now Available
Index(es):
- Date
- Thread