[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[silpa-discuss] Dictionary Index Generator script. v2
From: |
Vasudev Kamath |
Subject: |
[silpa-discuss] Dictionary Index Generator script. v2 |
Date: |
Tue, 27 Apr 2010 18:59:13 +0530 |
User-agent: |
KMail/1.12.4 (Linux/2.6.33-2.slh.10-sidux-686; KDE/4.3.4; i686; ; ) |
Hi all,
As santhos mentioned the new script now saves the index file as human readable
file with following format
A=1
B=2000
...
I've tested it with multiple language and it works fine. I'm attaching the new
script please test it and let me know of any bugs.
The English dictionary file can now be converted into UTF-8 format with
following command
iconv -f ISO-8859-1 -t UTF-8 en_US.dic > en_US_utf-8.dic
the new file can then be renamed to proper english directory, After UTF-8
conversion the codecs.open with utf-8 works fine and all words are read with
out any issues.
Santhosh has mentioned that after conversion of en_US to utf-8 spell checker
module was throwing wrong spelling for the words which are present in
dictionary. I found that issue is not with converting en_US to utf-8 file.
Here is what is actually going on
We have changed spell checker module to convert input characters to lower
case. For eg. Input Hello was told wrong because dictionary contained only
hello so to deal with this we were converting input words to lower case.
This gave rise to new issue
New Issue: Input AOL gets converted into aol before checking against
dictionary but dictionary contains only AOL and not aol hence aol is flagged
as wrong spelling.
We need to come up with a new strategy to deal with this and point that is to
be noted this issue is only related to en_US dictionary .
I'm going to work on integrating this new indexing approach with Silpa once
i'm up with working code i'll share it here
Thanks and Regards
Vasudev Kamath
indexer_v2.py
Description: Text Data
- [silpa-discuss] Dictionary Index Generator script. v2,
Vasudev Kamath <=