silpa-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[silpa-discuss] Dictionary Index Generator script. v2


From: Vasudev Kamath
Subject: [silpa-discuss] Dictionary Index Generator script. v2
Date: Tue, 27 Apr 2010 18:59:13 +0530
User-agent: KMail/1.12.4 (Linux/2.6.33-2.slh.10-sidux-686; KDE/4.3.4; i686; ; )

Hi all,
As santhos mentioned the new script now saves the index file as human readable 
file with following format
A=1
B=2000
...
I've tested it with multiple language and it works fine. I'm attaching the new 
script please test it and let me know of any bugs.

The English dictionary file can now be converted into UTF-8 format with 
following command

iconv -f ISO-8859-1 -t UTF-8 en_US.dic > en_US_utf-8.dic 

the new file can then be renamed to proper english directory, After UTF-8 
conversion the codecs.open with utf-8 works fine and all words are read with 
out any issues. 
Santhosh has mentioned that after conversion of en_US to utf-8 spell checker 
module was throwing wrong spelling for the words which are present in 
dictionary. I found that issue is not with converting en_US to utf-8 file. 
Here is what is actually going on
We have changed spell checker module to convert input characters to lower 
case. For eg. Input Hello was told wrong because dictionary contained only 
hello so to deal with this we were converting input words to lower case.
This gave rise to new issue
New Issue: Input AOL gets converted into aol before checking against 
dictionary but dictionary contains only  AOL and not aol hence aol is flagged 
as wrong spelling.

We need to come up with a new strategy to deal with this and point that is to 
be noted this issue is only related to en_US dictionary .

I'm going to work on integrating this new indexing approach with Silpa once 
i'm up with working code i'll share it here

Thanks and Regards
Vasudev Kamath

Attachment: indexer_v2.py
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]