silpa-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [silpa-discuss] spell checker for Tamil


From: Santhosh Thottingal
Subject: Re: [silpa-discuss] spell checker for Tamil
Date: Thu, 4 Oct 2012 23:30:15 +0530

Our spellchecker module is not using hunspell. The algorithm is an
optimized levenshtein distance based.
1. It can detect language(SILPA has language detection common module)
- so a document with mixed scripts(Say Tamil with some English words)
can be spell checked.
2. For the suggestions, instead of getting all words from the edit
distance, we check the phonetic similarity to rank them. This is based
on the assumption that, In Indian languages, unlike English, spelling
errors are mostly on phonetic variations.
3. It uses the same wordlist from hunspell dictionaries for many languages.

I did many experiments with hunspell on agglutination and inflection
features. First it was buggy for compound word formation, second, it
was very difficult for  me to figure out how it can be configured to
work say 4 level agglutination and all.

I still don't have a clear idea on how this challenge can be solved.
We need to build lot of other language processing tools before we
approach this issue. For eg, we need a lemmatizer/stemmer to find root
words, its inflections , and then to figure out agglutination
patterns. I checked with my friends from other (non-indic) languages
about how they solved this. Essentially all of them had spent years in
preparing root word, inflection/agglutination patterns in
semi-automatic ways. Sadly I did not get enough time to continue on
this research.

I heard that for Tamil there are some proprietary spellcheckers with
reasonably good results. I don't know if there are any published
algorithm used for that.

Hunspell Hindi spellchecker dictionary has some word formation rules
in its suffix file. Did you check that? Tamil more complex since
spellcheck involves the context as well(as in duplication of last
letter from next word - eg: Tamilp-padam.


A small example of SILPA spellchecker
http://thottingal.in/projects/spellchecker/

A document I prepared based on some study on Malayalam grammar
http://thottingal.in/documents/MalayalamComputingChallenges.pdf . It
is quite outdated and focus on Malayalam, but you can find some ideas
in it.


Thanks
Santhosh

2012/10/4 Shrinivasan T <address@hidden>:
> i am exploring hunspell for adding more words and rules.
>
> but people have different opinions on using it.
>
> what do you guys think on it?
>
> what does silpa uses for spell checking?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]