[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [aspell-devel] Thoughts on using aspell for Indian language ing
From: |
Jose Da Silva |
Subject: |
Re: [aspell-devel] Thoughts on using aspell for Indian language ing |
Date: |
Mon, 13 Nov 2006 16:19:14 -0800 |
User-agent: |
KMail/1.7.2 |
In case other readers are following the unicode mentioned in these
threads: http://www.unicode.org/charts/PDF/U0900.pdf
On November 13, 2006 05:03 am, Kevin Atkinson wrote:
> On Mon, 13 Nov 2006, address@hidden wrote:
> > On 11:42:07 am 11/13/06 Kevin Atkinson <address@hidden> wrote:
> I do believe to truly handle this situation well some modifications
> will need to be made to Aspell. I suggest you start studying
> readonly_ws.cpp and suggest.cpp. I while ago I wrote some docs on
> how Aspell works:
> http://lists.gnu.org/archive/html/aspell-devel/2005-09/msg00007.html
> http://lists.gnu.org/archive/html/aspell-devel/2005-10/msg00000.html
> which may be helpful.
>
> I will get back to you latter with some ideas on how to approach this
> issue. If you already thought of some please share them.
Just an idea, but maybe this could be done more modular as plug-ins.
So, if spellchecking English, you use English plug-ins, Hindi - use
Hindi plug-ins, French - French plug-ins, etc and likewise with other
languages which may benefit from other scoring methods that aren't
English-like based.
Each language has probably preferred methods on suggesting words whether
it is soundslike, lookslike, based on swapping, consonants, syllables,
accents, halants, gender, etc.
For example, Aspell may like to strip accents and swap characters for
English, but as Gora indicates, perhaps Aspell will benefit from some
sort of plugin geared towards Hindi that is more Hindi based which
treats groups of characters as 1 and then swapping that 1 with another
likely group of characters.
--------------
Other ideas, From reading the list and viewing other points (example
Kevin mentioning Ethan's scoring being considered for maybe 0.61, some
changes I've made are in 0.61 and not in 0.60, etc...), I'm guessing
Kevin is trying to close-off the 0.60.x as a stable version with really
no big changes beyond obvious bug fixes.
Perhaps the big changes like Ethan's scoring and Gora's fixes that are
incorporated (and working) in the 0.60.3 version could be diff(ed) to
take the modifications/fixes and start applying those ideas towards the
0.61 version.
-------------------
Comments about the initial questions Gora asked in this thread:
> 1. I would like to volunteer to work on writing a proper C++
> interface to aspell. This would include a public interface that
> exposes only the normal spellchecking facilities in a class, as well
> as a testing interface that provides access to internals like the
> scores, weights, and even costs for computing edit distance. I
> already have something that makes the testing part available, but it
> is rather hacked up. If we can discuss what might be an interface
> that can get accepted into aspell, I would be glad to work on it.
> 2. I have done some more work on making bindings to aspell available
> in other programming languages, and, at present, Python, Perl and C#
> bindings are available, through SWIG. What I would like to do is
> first build a C++ class-based interface, and use that as a basis for
> a consistent interface across all languages. Besides the bindings,
> this would include example programs for using them, as well as GUI
> implementations in at least one language that provide a front-end to
> spellchecking, as well as to the testing framework.
C++ may be easier for some users, but if you want to have a binary
compatible library, you should really try to stay with one language
like C so that it remains somewhat binary compatible with existing
programs.
> 3. I see some major stumbling blocks in making aspell work properly
> with Indian languages. Perhaps the most significant one is that in
> Indian languages it makes sense to deal with syllables (a clump of
> consonants, possibly with vowel modifiers), rather than with
> individual characters. Thus, for example, edit distance operations
> should work on syllables. This is a little difficult, though not
> impossible, to do with the present, non-Unicode, internal functioning
> of aspell. One way would be to have a function inside score_list()
> that reconverts to Unicode, and works on syllables. However, it seems
> silly to do this, rather than having Unicode throughout. I am aware
> of Kevin's arguments for retaining the 128-character space used by
> aspell, but do not see a clean mechanism for handling complex scripts
> within such a framework. Comments on this would be appreciated.
As mentioned earlier, it's about 254 characters (+ linefeed + 0x00) but
on the surface, it can look like unicode U0900...U09ff.
In this case, it may make more sense to create special read_only.cpp and
suggest.cpp specific routines to treat groups of syllables as one. If
it is in some sort of plug-in type of format, it may open the door to
other languages and their special needs.
> 4. There are other niceties that would improve spellchecking in
> Indian languages, such as the use of a morphological analyser to
> identify the type of the word, and also its gender if it is a noun.
> This can however, probably be handled by a pre or post filters to
> normal aspell checking.
I would suggest to please go ahead with the idea, someone has to be the
1st to try and build something, so if you have the energy and
determination to do it, please go ahead, and eventually the other
(languages) will follow in time. I would suggest maybe trying to keep
your ideas universal so that they can be modified / adapted for other
languages.
Some of us do run-out-of-steam, so hearing someone pledging to Volunteer
to do something goes a long way, and I suggest it is better to try and
take Aspell as far as you can, possibly improve it versus creating yet
another fork (you can notice various versions and types of spell
checkers out there in various states of maintenance or disrepair, some
have good ideas, while others ran-out-of-steam while being built).
Cheers!
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, (continued)
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, gora, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, Kevin Atkinson, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, gora, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, Kevin Atkinson, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, gora, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, Kevin Atkinson, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, gora, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, Ethan Bradford, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, gora, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing, Jose Da Silva, 2006/11/13
- Re: [aspell-devel] Thoughts on using aspell for Indian language ing,
Jose Da Silva <=